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Mathematical  models  to  describe  air  pollution  are  necessary  planning 
tools  in  the  air  quality  management  process.   They  provide  a  means  of 
predicting  and  controlling  pollution.   Classical  models  are  necessary 
in  the  cibsence  of  reliable  ambient  measurements.   These  models  have 
become  increasingly  sophisticated,  but  they  remain  inadequate  to  des- 
cribe many  of  the  phenomena  which  actually  occur  in  the  atmosphere. 
Statistical  methods  applied  to  reliable  air  quality  influence  parameters 
can  advance  the  science  of  modeling,  leading  to  stochastic  models  to 
simulate  and  better  understand  air  pollut_  i.   Ultimately,  these  tech- 
niques will  provide  capabilities  for  real-time  analysis  and  control  of 
air  pollution. 

Statistical  models  for  air  pollution  have  generally  been  limited  to 
specification  of  distribution  functions  or  analysis  of  orthogonal  com- 
ponents because  dynamic  statistical  models  are  complicated  under  the 
non- stationary  conditions  encountered  in  the  atmosphere.   This  disser- 
tation investigates  the  application  of  pattern  recognition  techniques 
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to  identify  quasi-stationary  conditions  with  respect  to  both  space  and 
time.  Systems  models  approximating  these  quasi-stationary  groupings 
can  be  constructed  when  adequate,  influence  data  become  available.  An 
analysis  of  raesoscale  meteorological  conditions  was  used  to  illustrate 
how  influence  parsuneters  can  be  identified. 

Photochemical  oxidants  are  a  national  concern.  The  primary  consti- 
tuent is  ozone  which  can  be  accurately  measured  using  a  chemilumines- 
cent  technique-   During  1974,  the  Tampa,  Florida,  area  experienced  many 
elevated  oxidant  periods  which  were  documented  on  well-calibrated  con- 
tinuous monitors.  These  data  were  therefore  selected  to  determine 
conditions  under  which  stochastic  air  pollution  models  could  be  devel- 
oped. 

The  multivariate  approach  developed  in  this  dissertation  for  iden- 
tifying similar  ozone  waveforms  is  a  powerful  analytical  tool.   The 
method  is  not  restricted  to  air  pollution  research,  but  extensive  ap- 
plications in  atmospheric  studies  are  appropriate.   Perhaps  the  most  useful 
aspect  of  the  method  is  the  ability  to  identify  conditions  which  facil- 
itate model  development.   The  most  frequently  encountered  condition  in 
this  study  was  designated  the  non-episode  type.  For  these  days,  the 
oxidant  standard  was  rarely  exceeded;  most  sampling  sites  recorded  the 
same  type  of  waveform;  aind  many  consecutive  days  reported  this  condi- 
tion.  These  days  would  be  relatively  easy  to  model  since  mesoscale 
parameters  would  generally  be  adequate,  many  examples  of  these  condi- 
tions are  available  and  variations  between  days  in  this  group  are  small. 
The  next  most  frequently  occurring  condition  was  called  eui  episode  day 
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since  the  oxidant  stcuidard  was  often  exceeded,  but  ordinarily  for  only 
one  or  two  hours  in  each  day.  Spatial  differences  become  increasingly 
in?)ortant  for  this  group  indicating  a  need  for  microscale  data  in  some 
instances.  More  variance  was  encountered  and  many  different  waveforms 
were  represented.  A  tMrd  type  of  day  was  designated  super  episode 
since  the  oxidant  standard  was  usually  exceeded  and   elevated  levels 
persisted  for  several  hours.  Super  episode  conditions  often  followed 
frontal  passage  under  high  pressure  conditions  when  solar  insolation 
was  very  high.  Many  of  these  days  exhibited  rapid  change  in  concentra- 
tion and  radically  differing  waveforms.  Spatial  differences  were  also 
significant  on  many  of  these  occasions.  Consequently,  super  episode 
conditions  would  be  most  difficult  to  model  and  would  require  a  dense 
network  providing  microscale  influence  data  and  frequent  sampling. 
There  were,  however,  a  number  of  frequently  repeated  spatially  homo- 
geneous conditions  within  the  super  episode  group.  These  would  be  good 
candidates  for  stochastic  modeling. 
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CHAPTER  I  -  INTRODUCTION 

There  are  many  approaches  to  predicting  air  quality  in  today's  society. 
This  is  natural  since  the  problem  is  indeed  complicated.  As  discussed 
in  Gifford  (16, p. 84),  the  most  general  equation  describing  pollutant 
transport  is  a  hyperbolic  partial  differential  equation.   Thers  is 
actually  no  known  closed-form  solution  for  this  equation.   Consequently, 
different  methods  of  finding  approximate  solutions  have  evolved.  The 
two  basic  classes  of  methods  are  called  a  priori  (before  the  fact)  and 
a  posteriori  (after  the  fact) .   A  priori  methods  concentrate  on  a  know- 
ledge of  the  physics  involved  to  make  appropriate  assumptions  and  to 
establish  boundary  conditions.   Similarity  variables,  separation  of 
variables  and  integral  transform  methods  are  then  used  to  obtain  ana- 
lytical solutions.   A  posteriori  mediods  operate  on  obsarved  variables 
as  they  change  with  time  to  develop  functional  relationships  between 
influence  parameters  (inputs)  and  the  pollutant  modeled  (output).  Beth 
types  of  models  are  useful  tools  for  understanding  the  causes  and  matii- 
cds  of  controlling  air  pollution. 

There  are  many  kinds  of  a  priori  models.   The  simplest  type  uses  a 
single  analytical  equation  for  which  various  combinations  of  "steady" 
input  parameters  yield  concentration  as  a  function  of  distance  from 
the  sources.   They  are  divided  into  two  types  called  long-term  and 
short-term  models.   Both  use  the  same  equation,  but  the  manner  of 


treating  the  output  concentration  differs.   Long-term  models  weight 
the  result  from  each  set  of  conditions  by  an  empirically  determined 
frequency  of  occurrence  for  that  condition.  The  sura  of  all  such  con- 
centrations represents  a  long  term  average  pollutant  level  as  a  f\inc- 
tion  of  space.  Isopleth  contoiir  maps  may  then  be  plotted.   Short-term 
nodels  provide  down-wind  steady-state  concentrations.   The  averaging 
time  for  these  results  is  assumed  to  be  on  the  order  of  10  minutes, 
and  longer  averaging  times  (30  min.,  1  hour,  etc)  are  estimated  empir- 
ically by  considering  meander  in  the  unsteady  wind.   Sophisticated  a 
priori  models  use  numerical  integration  techniques.   Forward  difference, 
central  difference,  backward  difference  and  coiT±)inations  are  used  with 
a  moving  (Lagrangian)  coordinate  system,  as  well  as  the  more  common 
fixed  (Eulerian)  coordinate  system.  Generally,  long-term  models  pro- 
vide better  estimates  since  averages  are  easier  to  predict  than  specific 
conditions.   In  either  case,  the  model  performance  is  evaluated  by  com- 
paring model  projections  with  observed  air  quality  iinder  the  assumed 
conditions.   The  practice  is  known  as  calibration.   Millions  of  dollars 
have  been  spent  to  develop  a  priori  models  in  recent  years. 

The  two  types  of  a  posteriori  models  are  loosely  referred  to  as  static 
and  dynamic.  Static  models  focus  on  the  underlying  set  of  distributions 
which  characterize  observed  air  quality.   Most  popular  of  these  is  the 
lognormal  family.   Two  kinds  of  questions  are  answered  by  such  models. 
The  first  is  the  empirical  probability  of  observing  a  value  exceeding 
some  fixed  limit,  presumably  an  air  quality  standard,  during  the  year 


sampled.   The  second  issue  to  be  resolved  is  the  probability  that  this 
limit  would  be  exceeded  in  subsequent  years.   This  type  of  model  is 
directly  related  to  air  quality  standards  and  it  therefore  has  been 
used  and  abused  extensively. 

Dynamic  statistical  (stochastic)  models  have  not  been  thoroughly  inves- 
tigated.  There  are  many  reasons  which  explain  this  historical  lack  of 
interest  in  stochastic  systems  for  air  pollution.  A  mathematical  basis 
in  terms  of  the  physical  process  provides  a  framework  in  which  to 
understand  pollutant  dispersion.  Consequently,  there  is  an  inherent 
bias  toward  physical  models.   The  theoretist  will  resort  to  empirical 
techniques  only  after  he  has  thoroughly  exhausted  his  analytical  ave- 
nues.  Empiricists  ultimately  attempt  to  relate  conclusions  to  physical 
laws  to  facilitate  interpretation  of  results.   Naturally,  a  priori 
models  are  developed  first.   Attention  turns  to  empirical  models  when 
the  physical  models  become  inadequate  or  when  additional  viewpoints 
become  necessary.   Moreover,  in  air  pollution,  the  inaccuracy  of  field 
measurements  becomes  extremely  important  since  trace  concentrations 
must  be  studied.   Surface  chemistry,  molecular  interaction  and  inter- 
ference become  serious  problems  which  undermine  the  validity  of  data. 
Empirical  models  are  limited  by  the  axiom  "garbage  in  yields  garbage 
out,"   Consequently,  these  models  cannot  really  be  developed  until 
reliable  aerometric  data  are  available. 

In  recent  years  it  has  become  increasingly  clear  that  the  dispersion  of 
air  pollutants  is  a  stochastic  phenomenon.   Realistically,  a  simple 


analytical  solution  cannot  provide  reliable  short-term  solutions  under 
conditions  normally  encountered  in  the  atmosphere.   Supporters  of  a 
priori  models  have  argued  that  stochastic  models  would  be  applicable 
only  to  the  region  for  which  they  were  developed.   However,  the  same 
factors  which  contribute  to  regional  applicability  tend  to  violate  the 
assumptions  required  in  physical  models.   Consequently,  as  reliable 
continuous  air  quality  data  become  more  available,  interest  in  stochas- 
tic models  will  increase. 

Perhaps  the  most  troublesome  aspect  of  air  pollution  modeling  is  the 
nonstationary  character  of  the  atmospheric  transport  process.  Simply 
stated,  this  means  that  the  system  being  modeled  changer-j  with  time. 
Systems  engineers  have  coped  with  this  kind  of  problem  by  developing 
adaptive  filters.   However,  a  systematic  basis  for  adjusting  the  filter 
coefficients  is  required.   In  this  dissertation,  the  nonstationary 
character  exhibited  by  ozone  data  reported  in  Hillsborough  and  Pinellas 
counties  in  Florida  is  investigated.   These  data  were  selected  due  to 
the  national  importance  of  the  photochemical  oxidant  problem  and  the 
availability  of  rigorously  verified  data.  The  objective  is  to  classify 
reported  episodes  into  quasi-stationary  intervals  using  techniques  from 
pattern  recognition.  These  constitute  a  basis  for  multiple  stochastic 
model  development  and  also  prove  to  be  a  useful  diagnostic  tool  for 
understanding  photochemical  air  pollution. 


CHAPTER  II 
BACKGROUND  AND  PROBLEM  STATEI>IENT 

The  Study  of  air  pollution  as  a  function  of  space  and  time  has  long 
been  a  human  concern.  Pollutant  dispersion  is  described  by  differen- 
tial equations-   Mathematical  techniques  for  solving  these  equations 
have  historically  been  adapted  from  the  study  of  heat  transport.   This 
is  primarily  due  to  the  comprehensive  analytical  solutions  developed 
for  various  boundary  conditions  and  input  functions.  Atmospheric  sys- 
tems contain  random  variables  and  random  functions  of  these  variables. 
Differential  equations  for  random  functions  are  called  stochastic  pro- 
cesses and  they  generally  do  not  have  analytical  solutions.  An  ap- 
proach, which  avoids  steady-state  analytical  solutions,  is  numerical 
integration.  The  most  advanced  air  pollution  models  in  today's  society 
use  this  technique. 

Regardless  of  the  methods  used  to  model  a  process,  comparison  with 
measured  results  is  essential.  Empirical  analysis  of  these  measured 
values  can  provide  insight  concerning  the  functional  relationships 
between  influence  parameters  and  pollutant  concentrations  in  addition 
to  validating  models.  Thorough  statistical  analysis  of  observed  con- 
centrations, viewed  as  non-stationary  dynamic  systems,  leads  to  the 
development  of  stochastic  dispersion  models.  This  system  theory  ap- 
proach to  air  pol].ution  is  investigated  in  this  dissertation. 
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Brownian  motion  provides  a  starting  point  for  understanding  atmospheric 
stochastic  processes.  In  a  paper  by  Einstein  (12)  the  irregular  motion 
of  colloidal  particles  based  on  collisions  with  surrounding  fluid  mole- 
cules was  described.  A  thorough  treatment  of  Brownian  motion  is  given 
in  Fuchs  (14) ,  showing  that  the  velocity  history  of  a  colloidal  parti- 
cle is  a  random  (stochastic)  process.  The  equation  describing  Brownian 
motion  is 


dU(t)  ^  {j(t)  =  [A]U(t)+rB]W(t)  2.1 

dt 


where 


_  =  total  derivative  with  respect  to  time 
dt 

U(t)  =  particle  velocity  vector  at  time  t(a  state  variable) 

[A]  =  -S[Il  the  system  matrix  with: 
g  =  6iTaij/m  (particle  mobility) 

a  =  radius  of  spherical  particle 
\i   =  fluid  viscosity 
m  =  partical  mass 
[I]  =  identity  matrix 

[B]  =  distribution  matrix  relating  input  variables 

to  state  variables 

W(t)  =  input  vector  of  white  noise 

representing  molecular  collisions. 

Equations  of  this  form  were  first  treated  by  Langevin  (18)  ,  who  hypo- 
thesized two  additive  components.   The  first  component,  ([A]U(t)),  is 
the  systematic  dynamic  friction  (drag).   The  second  component,  ([BlW(t)), 
represents  the  random  force  exerted  by  molecular  collisions.   Brownian 


motion  is  also  known  a^s  the  Wi       -ener  process  after  Norbert  Wiener    (48) 
who  formulated  much  of        the  pre       sent  theory  of  time  series. 


Integrating  Langevin's        equatio      31  yields 

U(t)    =  e^^U(o)+— eAtjJe" ^^BW(X)dA 

where 

eAt  =  exp(At)  d  the  mat  i^rix  exponential) 
U(o)    =  velocity^-   vector  at  time  zero;    and 

X  =  X-t,    th^==    lag  pa ^craiaeter. 


2.2 


From  statistical  mechan=_ics  we  i:^=<now  that  the  kinetic  energy  from  the 
ui,   U2,   U3    (x,    y  or   z)         componenrsnt  of   the  Brownian  particle   is 

»im<u?(t)>   =  kkT  2.3 

where 

2 
<u£(t)>  =  e=^  usemble average  squared  velocity  component 

k  =  Bi^-oltzmanjM—  i '  s   constant 

T  =  a — iisolute     temperature. 

Since  uf  (t)  is  not  a  fu,:i_j[iction  (   -)f  time,  the  process  is  stationary,  having 
a  time  invariant  autoco-  j:relati<   _)n  function.  This  autocorrelation  func- 
tion, defined  as 

.     .  <     -U1  (t)     U 5f   (t+T)> 

2.4 


^,    ,  <    -ui(t)    u ^  (t+T)> 

R(T)    =  — 


<u2(tZZ>  > 
can  easily  be  determine-  <a  from  -^^squation  2.2. 

For  Brownian  motion,    th_  «  ensemi- )le  average  of  all  times  containing  the 


input  vector  -. ■■ ;  c)  is  zero .  Thus  only  the  drag  component  enters  into 
this  equation  resulting  in 

„,  ,    <U(o)  U(T)>    ,T    _8t. 

R(T  =  = =  e^^  =  e  P^  2.5 

<u2(t)> 

This  is  the  autocorrelation  function  of  a  Markov  process,  characterized 
in  Arnold  (2,  p. 27)  by  the  property  "if  the  state  of  a  system  at  the 
present  is  knovm,  information  from  past  times  has  no  effect  on  our 
knowledge  of  the  probable  development  of  the  system  in  the  future." 
In  other  words,  the  system  has  no  memory. 

The  work  of  Taylor  (45)  provides  some  very  important  results  in  the 
field  of  turbulent  diffusion  theory.   Taylor's  theorem  relates  the 
expected  particle  displacement  of  a  diffusing  cloud  to  the  velocity 
autocorrelation  function.  For  an  initially  concentrated  cloud  (instan- 
taneous point  source)  Taylor's  theorem  is 

^^  =  2<u'(t)>  i;-R(T)dT  2.6 


'l^^^>^> 


dt 

Here  a^   is  the  radius  of  inertia,  a  measure  of  cloud  size.  Substituting 
the  autocorrelation  for  Brownian  motion  and  integrating  this  equation 
yields 

Ojj^  =  2<ljl(t)>\^   -i2(l-e"'^'-)|  2.7 


[ft'-^-i 


For  unit  density  spheres  with  radii  smaller  than  10  micrometers,  the 
relaxation  time  (Q~^)    is  less  than  0.01  seconds.  Thus,  for  time  on 
the  order  of  seconds. 


„  2   2<uf(t)>t 

"X  =    J-      =  2Dt  2.8 

8 

where 

kT 
D  = (tJie  diffusivity  constant)  . 


The  equipartition  of  energy  principle  from  statistical  mechcinics  and 
the  integrated  Langevin's  equation,  together  with  Taylor's  theorem, 
yield  Einstein's  conclusion  that  Brownian  motion  can  be  simply  defined 
using  a  diffusivity  constant. 

The  classical  diffusion  equation  representing  Brownian  motion  is  an 
instantaneous  point  source  with  no  bulk  fluid  motion.  Thus, 

l£  =  DV^c  =  D  1^  +  lie  +  ^  2.9 

3^  3x2    ay2   3z2 

where 

3  [-1 

TL —  =  partial  derivative  with  respect  to  time 

7  =  the  Laplacian  operator 
c  =  concentration  of  diffusing  substance 
D  =  a  constant  diffusivity. 

Analytical  solutions  for  this  equation,  and  for  augmented  equations 
representing  atmospheric  conditions,  are  abundant  in  the  literature. 
Two  particularly  readable  and  comprehensive  summaries  are  Slade  (41) 
and  Csanady  (7).  For  the  isotropic  conditions  ia^=ay=a2)    exhibited 
in  Brownian  motion,  the  analytical  solution  is 
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c(r,t)  =  Q(2Tro2)-V2exp[(-r2)/2a2)]  2.10 

This  is  the  cumulative  distribution  function  for  the  Gaussian,  or 

normal,  probability  law 

where 

Q  =  source  strength 

r^  =  (x^+y2+z^)  -  squared  radius  of  diffusing  cloud 

a^  =  2  Dt. 

The  diffusivity  required  to  satisfy  equation  2.10  is  the  same  diffu- 
sivity  derived  for  the  stochastic  formulation  of  Brownian  motion  given 
in  equation  (2.8).   Thus,  the  stochastic  and  analytical  approaches 
yield  the  same  results  for  the  special  conditions  of  Brownian  motion. 

For  n  chemically  reactive  constituents,  the  continuity  equation  is 

3Ci     (uiCi)     (u2Ci)     (u3Ci)     „2 

W      dr~  ^  iT"  "*■  §r~  "     i  ■*■  %(Ci,...,Cn,T)+Si(x,y,z,t) 

2.11 
where 

n  =  number  of  pollutants  involved 
C.  =  concentration  of  constituent  i 
{Ui,U2/U3)  =  (x,y,z)  components  of  wind  velocity 

R^  =  rate  of  formation  of  constituent  i  by  chemical  reaction 
Sj_  =  rate  of  emission  of  constituent  i  from  sources. 
In  practice,  only  n-1  equations  eire  required  since  the  remaining  con- 
stituent can  be  determined  by  difference. 
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In  modeling  air  pollution,  one  can  usually  assume  that  the  pollutants 
are  merely  tracers  of  atmospheric  activity.  Under  this  assumption 
equation  2.11  can  be  solved  alone  without  considering  the  coupled 
Navier-Stokes  (27)  and  energy  equations.  Now,  each  wind  velocity  com- 
ponent consists  of  a  deterministic  and  a  stochastic  component  (U'=u-+u') 
Similarly,  each  C^  consists  of  an  ensemble- average  concentration  and  a 
random  component  (Cj_=<C^>+c|)  .   Ordinarily,  molecular  diffusion  is 
negligible  compared  to  turbulent  dispersion.   Introducing  these  con- 
siderations and  taking  the  expected  value  of  equation  2.11  yields  the 
"K-theory"  coupled  partial  differential  equations  presented  in  earlier 
dispersion  modeling  studies;  for  example,  see  Pasquill  (30)  and  Slade 
(41) .   The  general  form  of  this  equation  is 


3<Ci>        3       _  3  3       _ 

-3t~  ■*■  ^  [^'l<Ci>l    +  g^  [TT2<Ci>]    +  ^  [U3<Ci>l    = 


|^[K^|^<Ci>]     +1^      [Kyy|^<Ci>]     +  |^  [  Kzz|^  <Ci>]     + 


^i<^^j''j=l,n'^)    •*■  Si(x,y,z,t)  2.12 


where 


^aa    ^^  =  ~'^U4'"i^'  multiplicative  stochastic  diffusivity 
3  components  and 

Kgij  =  0.0  is  assumed  for  a  7^  b. 


Analytical  solutions  for  Ci(x,y,z,t)  are  derived  by  making  assumptions 
in  the  above  equation  and  applying  classical  mathematics  [Carslaw  and 
Jaeger  (  6 )  ]  • 
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Unfortunately,  dispersion  modeling  is  complicated  by  "eddy  diffusion" 
when  random  motions  of  the  transporting  fluid  enter  the  equation. 
Equation  2.12  is  analogous  to  equation  2.9  with  "eddy  diffusivities" 
(Rs)  in  place  of  molecular  diffusivity  (D) .  This  difference  can  be 
interpreted  in  terms  of  the  "scale  of  turbulence."  For  Brownian  motion 
the  length  scale  is  the  mean  free  path  and  the  time  scale  (t^)  is  6"-^. 
For  turbulent  flow  there  is  no  single  time  or  length  scale.   Turbulent 
flow  in  the  atmosphere  is  characterized  by  eddies  of  vastly  different 
sizes.  Large  eddies  extract  energy  from  the  bulk  flow.   Smaller  eddies 
feed  on  the  energy  of  larger  eddies  resulting  in  an  "energy  cascade." 
The  eddy  energy  comes  from  the  bulk  fluid  energy  augmented  by  solar 
insolation ,  surface  heat  and  topographic  factors .  These  eddies  move 
groups  of  particles  together,  so  the  motion  of  neighboring  particles 
is  not  independent.   This  correlation  of  motion  over  many  size  scales 
causes  unpredictable  differences  such  that  the  obseirved  concentrations 
are  random  variables.  Further,  eddy  diffusivities  are  not  constant  in 
space  or  time.   Under  these  conditions,  there  is  no  analytical  solution 
to  equation  2.12.   Many  researchers  have  turned  to  numerical  integration 
to  overcome  this  limitation  [Reynolds  et  al.  (36)  and  (37)  ;  California 
Air  Resources  Board  (CARS) (  5  ) ;  and  Randerson  (35)] . 

The  status  of  recent  air  pollution  modeling  is  svunmarized  in  NATO/CCMS 
(26) .   The  advantages  and  limitations  of  both  classical  and  statistical 
models  are  presented  in  a  complementary  context.   Advanced  classical 
models  require  extensive  specification  of  wind  fields,  diffusivities 
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and  constituent  concentrations  as  functions  of  space  and  time.   Problems 
with  these  models  include  large  computer  time  and  storage  requirements, 
stability  of  integration  schemes,  "artificial  diffusion,"  and  cumulative 
errors  (CARB  (5)).   In  some  situations,  simple  models  yield  better 
results  than  the  advanced  models,  as  demonstrated  by  Gifford  in  NATO/ 
CCMS  (15, p. XVI).   Statistical  air  pollution  models  have  not  been  as 
extensively  developed  as  classical  models.   Very  fev/  dynamic  statistical 
models  have  been  formulated.  The  primary  reasons  for  this  are  avail- 
ability of  reliable  air  quality  data  from  dense  sampling  networks;  a 
lack  of  regional  appliccibility  of  the  models  developed;  mathematical 
complications;  and  most  importantly,  the  non-stationary  character  of 
air  pollution.  The  chronology  of  the  NATO  reports  clearly  illustrates 
tlie  increasing  emphasis  on  statistical  aspects  in  modern  air  quality 
simulation. 

A  stationary  process  nas  an  autocorrelation  function  which  does  not 
change  with  time.  Second  order  stationarity  implies  that  the  mean 
and  variance  of  repeated  experiments  are  constant.   Air  pollutant  dis- 
tributions cire  not  stationary,  primarily  due  to  changes  in  meterological 
conditions.   This  non- stationary  character  of  atmospheric  turbulence 
was  treated  by  Pasquill  (30) -  He  devised  "stability  classes"  to  des- 
cribe turbulence  types.   This  stationarity  concept  is  closely  coupled 
to  the  duration  of  an  experiment  and  the  averaging  time  of  samples. 
Although  diurnal  variation  denies  stationarity  with  respect  to  24  con- 
secutive hour-long  experiments,  day-long  experiments  may  be  stationary 
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with  respect  to  one  another.   Similarly,  major  changes  in  meterology, 
such  as  storms,  or  frontal  passage,  disrupt  stationary  conditions. 

As  pointed  out  by  Pasquill  (30, p. 17),  it  is  this  non-stationary  nature 
of  atmospheric  processes  which  limits  the  use  of  time-series  techniques, 
Selecting  subgroups  of  data,  which  are  stationary  with  respect  to  one 
another,  provides  a  means  of  developing  dynamic  statistical  models  for 
air  pollution.  Stochastic  models  offer  some  very  attractive  qualities. 
Since  their  random  dynamic  nature  is  analogous  to  observed  atmospheric 
conditions,  they  should  prove  superior  to  the  steady-state  Gaussian 
models  popularized  in  the  recent  past. 


CHAPTER  III 
PATTERN  RECOGNITION  TO  IDENTIFY  STATIONARY  INTERVALS 

Selecting  reproducible  conditions  in  air  pollution  can  be  viewed  as  a 
problem  in  pattern  recognition.  Many  methods  which  accomplish  this 
objective  can  be  formulated.  A  good  method  must  operate  on  readily 
available  data  to  delineate  similar  conditions  at  reasonable  cost.  One 
could  begin  by  plotting  the  observed  concentration  as  a  function  of  time 
and  studying  the  graph  to  pick  piecewise-similar  days.   Unfortunately, 
conditions  in  the  atmosphere  usually  change  in  rapid  and  random  fashion, 
as  when  fronts  pass  or  reacting  contaminants  interact.   Consequently, 
randomly  repeated  intervals  occur  which  are  not  ordinarily  sequential. 
An  additional  complication  concerns  differences  in  waveforms  occiorring 
at  the  same  time,  but  at  different  spatial  locations.   In  order  to  apply 
stochastic  modeling  techniques  to  air  pollution,  methods  for  identifying 
similar  conditions  randomly  distributed  through  space  and  time  are  nec- 
essary. 

Statistical  studies  of  time  behavior  traditionally  deal  with  time  cor- 
relation.  The  correlation  of  a  signal  with  itself  is  known  as  the  auto- 
correlation fiuiction  (p(t)).  Tau  (t)  is  a  time  function  representing 
lag  or  lead  from  zero  time  toward  past  or  future  time.   For  example 
p (o) ,  the  autocorrelation  of  zero  time,  is  computed  from  the  covariance 

of  a  sequence  of  niimbers.   For  a  zero  mean  sequence,  this  is  the  sum  of 
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all  squares  divided  by  the  number  of  observations.  At  each  lag  (for 
example  10  hours)  a  new  covariance  can  be  estimated  by  summing  the  pro- 
duct of  all  paired  obseirvations  T(ten)  hours  apart.  Dividing  each  of 
these  covariances  by  the  covariance  at  zero  lag  yields  the  sample  auto- 
correlation function.  By  analogy  two  signals  can  be  related  by  com- 
puting cross  products  as  one  signal  is  slid  past  the  other,  yielding 
the  crosscorrelation  function.  These  two  fvinctions  are  used  by  statis- 
ticians to  characterize  time  behavior.  When  the  relationship  between 
the  autocorrelation  and  crosscorrelation  functions  does  not  change  the 
system  is  called  time  invariant  or  stationary.  However,  when  this  rela- 
tionship changes  with  time ,  as  it  does  in  air  pollution  studies ,  multiple 
inputs  cuid  non-stationary  methods  are  essential. 

Engineers  are  usually  exposed  to  the  concept  of  frequency  response  in 
the  study  of  system  control.  The  way  a  system  reacts  to  changes  in 
input  is  described  in  terms  of  transfer  functions,  having  poles,  zeros 
and  transforms  (Laplace  and  Foxorier)  relating  time  to  frequency. 
Theoretically,  the  Fourier  transform  of  a  signal  and  the  Fourier  trans- 
form of  the  autocorrelation  function  are  directly  related  through  the 
power  spectral  density  function  as  discussed  in  Weiner  (48) -  Consequently, 
systems  having  the  same  correlation  functions  should  exhibit  the  same 
frequency  response.   Identifying  intervals  having  the  same  frequency 
response  specifies  a  quasi- stationaary  system.   This  approach,  using 
very  efficient  algorithms  to  compute  sliding  fast  Fourier  transforms 
(FFT)  as  in  Rabiner  and  Gold  (34, p. 382),  is  a  powerful  analytical 
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technique.  It  provides  a  way  to  compare  any  interval  of  time  with  any 
other  interval  based  on  the  system  response-  However,  this  direct 
approach  to  identifying  stationarity  is  expensive  and  does  not  take 
advantage  of  the  known  physical  processes  involved. 

One  of  the  most  significant  factors  in  atmospheric  studies  is  solar 
insolation.  The  sun  is  a  primary  driving  force  behind  meteorology. 
Wind  speed,  wind  direction  and  atmospheric  turbulance  are  related  to 
the  intensity  and  duration  of  incident  radiation  as  well  as  other  major 
physical  phenomena.  Since  the  period  of  rotation  of  the  earth  is  one 
day,  it  is  reasonable  to  divide  a  continuous  record  into  24  hour>-long 
intervals.   Longer  periods  can  be  selected  based  on  frontal  passage 
or  storm  cycles  as  appropriate.  These  periods  can  be  fvirther  sub- 
divided into  wind  direction  or  turbulence  types  as  is  currently  prac- 
ticed in  dispersion  modeling.   Differences  between  days  may  then  be 
viewed  as  multivariate  problems.  The  hour  of  day,  for  example,  can  be 
a  variable  and  each  day  would  then  be  represented  as  a  2-. -space  vector. 
In  this  way,  the  relationship  between  time  and  other  variables  can  be 
analyzed  using  classical  multivariate  techniques.   A  multivariate 
approach  to  identify  similar  patterns  of  air  pollution  is  presented 
in  this  chapter. 

A  fundamental  concept  in  multivariate  analysis  is  the  covariance  matrix. 
For  a  collection  of  days  having  24  hourly  averages,  the  coveiriance 
matrix  has  24  rows  and  24  columns  and  the  values  in  the  upper  right 
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triangle  are  equal  to  those  in  the  lower  left  triangle.  Each  diagonal 
element  is  the  sioin  of  squares  for  a  particular  hotar  divided  by  one  less 
than  the  total  number  of  days.   The  square  root  of  each  such  covariance 
is  the  standard  deviation  for  that  particular  hovu:.  The  off-diagonal 
values  represent  correlations  between  variables  and  are  cross  products 
of  pairs  representing  both  hoiars  summed  over  all  days  and  divided  by 
the  number  of  days.   Thus,  the  total  variance  of  the  matrix  is  the  sum 
of  its  diagonal  elements  and  the  intercorrelation  between  each  hour  and 
all  other  hours  is  represented  within  the  matrix.  Often  the  mean  for 
each  hour  is  first  stibtracted  from  the  raw  data,  resulting  in  a  covar- 
iance matrix  about  the  mean. 

By  mathematically  rearranging  the  covaricince  matrix,  a  concise  repre- 
sentation of  each  original  day  can  be  computed  using  only  a  few  new 
variables  rather  than  the  original  24.  This  powerful  technique,  dis- 
cussed in  detail  in  Morrison  (25)  and  in  Harmon  (17) ,  is  known  as  factor 
analysis.   The  first  step,  called  principal  component  analysis,  finds 
new  variables  made  up  of  proportionate  weights  of  each  of  the  original 
variables.  The  contributions  of  each  hour  are  assigned  in  a  way  which 
accounts  for  as  much  of  the  difference  between  individual  days  and  the 
"average  day"  as  possible.   The  second  component,  or  weighting  fvinction, 
is  constructed  perpendicular  to  the  first  component  in  a  way  which 
accounts  for  as  much  of  the  remaining  variance  as  possible.  Subsequent 
components  can  be  constructed  which  are  perpendicular  to  the  previous 
components  until,  finally,  all  of  the  variation  is  accounted  for.   The 
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analyst  can  then  select  the  minimum  number  of  components  to  recover  any 
desired  percentage  of  the  original  information.  Factor  scores  can  then 
be  computed  for  each  day.   These  factor  scores  determine  the  amount  of 
each  component  to  be  added  to  the  average  day  (mean  vector)  to  recon- 
struct the  original  hourly  observations.  The  factor  weights  are  added 
to  the  mean  vector  because  the  covariance  matrix,  from  which  the  eigen- 
vectors were  derived,  was  calculated  about  the  24  hourly  mean  values. 
The  object  is  to  reduce  the  dimensions  of  the  problem  by  projecting  the 
original  information  into  a  set  of  orthogonal  functions.  These  ortho- 
gonal ftinctions  (eigenvectors)  are  ranked  in  order  of  their  relative 
importance.  Thus,  the  original  data  can  be  represented  as  a  few  inde- 
pendent variables  vrfiile  retaining  most  of  the  information.  An  applica- 
tion of  principal  components  to  air  pollution  appears  in  Peterson  (32) 
and  a  particularly  readable  description  of  principal  component  analysis 
can  be  found  in  Stidd  (44) . 

This  procedure  can  be  succinctly  summarized  in  the  shorthand  of  matrix 
algebra.   A  linear  model  of  the  form 

[XI  =  [FHElT  +  M+  [e]  3.1 

is  to  be  constructed  where: 

[X]   =  estimate  matrix  reconstructing  original  data 

[F]  =  factor  score  matrix  representing  each  day 

[E]  =  factor  loading  matrix  of  weighting  fvmctions  for  each 
independent  factor 

T  =  trcuispose  operator 

[e  ]  =  error  matrix  or  difference  between  observed 
and  predicted  values;  and 
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M  =  mean  vector  or  average  day 
The  first  step  is  to  find  the  eigenvalues  (see  Appendix  II)  of  the 
singular  system. 

det  [[Xl^fxl  -  [I]X]  =  0  3.2 

where: 

det  [ • ]  =  determinant  of  matrix  [ • 1 
[x]^[X]  =  normalized  covariance  matrix  [gov] 
[l]  =  identity  matrix;  and 
X  =  eigenvalues  (24  in  this  application). 

Note  that  the  siam  of  the  eigenvalues  over  all  24  variables  equals  the 
total  variance  in  the  covariance  matrix.  Only  the  first  few  eigen- 
values are  retained  so  that  their  sum  represents  the  desired  percentage 
of  this  total  variance.  Eigenvectors  are  computed  for  each  eigenvalue 
retciined  so  that 

[covl  Ei  =  XiEi  3.3 

The  factor  loading  matrix  ([E])  has  a  24-valued  weighting  function 
representing  each  column  of  the  matrix-  These  are  the  independent 
factors  described  above.   Factor  scores  indicating  the  contribution 
of  each  factor  required  to  reconstruct  each  day  are  computed  from 

[F]  =  [El'^[covr-'-[X]  3.4 

where : 

[F]  =  factor  scores,  a  column  vector  for  each 
day ;  and 

[X]  =  obseirved  concentration  minus  the  mean 
for  each  hour 
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These  factor  scores  have  zero  mean  and  unit  variance  con^Juted  over  all 
days.  The  exainple  which  completes  this  chapter  illustrates  these  com- 
putations . 

Pattern  recognition  techniques  for  multivariate  systems  are  known  as 
cluster  analysis.  An  excellent  introduction  to  this  subject  is  Duran 
and  Odell  (11) .   Specific  details  of  various  schemes  may  be  found  by 
pursuing  their  extensive  reference  list.  There  are  basically  two  cate- 
gories of  cluster  methods.   Derisive  schemes  successively  partition 
cases  into  smaller  clusters.  They  begin  with  all  data  in  one  group. 
Agglomerative  methods  pool  subsets  of  cases  beginning  with  each  case 
in  a  group  by  itself.   The  cluster  analysis  scheme  used  in  this  work 
is  an  hierarchical  (agglomerative)  method  due  to  Ward  (47) . 

The  cluster  concept  implies  an  optimality  criterion  which  defines  the 
manner  in  which  groups  should  be  combined.   This  objective  function 
Ccin  be  stated  in  terms  of  the  homogeneity  within  a  cluster  and  the 
disparity  between  two  clusters.   These  measures  depend  on  the  distance 
between  groups.   The  euclidean  distance  between  two  vectors  Fi  and  Pj 

"  / 

de  (Fi'  Fj)  =  Z    (Fki-Fkj)' 
k=l 


=  [(Fi  -  Fj)T  (Fi  -  Fj)l^ 


3.5 


Note  that  Fs  are  employed  signifying  distance  between  vectors  of  factor 
scores . 
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This  real  valued  function  is  a  metric  because 
d(Pi,  Fj)  >  0  for  all  Fi  and  Fj 
d(Fi,  Fj)  =  0  if  and  only  if  Fj,  =  Fj 
d{Fi,  Fj)  =  d(Fj,  Fi) 
d(Fi,  Fj)  <  d(Fi,  Fjc)  +  d(Fk,  Fj). 

The  statistical  distance  between  two  clusters  can  similarly  be  defined. 
For  [f]  =  Fi,.,.,Fn^  and  [Y]  =  Y^, . . . ,Yn^,   where  Fi  and  Y^  are  f 
variable  members  of  two  clusters,  the  euclidean  distance  is 

_   "1         -   "2 
where  F  =  Z   ^i/ni  '  ^  =  ^   ^i/no '  ^^^   "l  ^^  "2  ^®  ^^  number  of 
i=l  i=l     ^ 

members  in  groups  [ F]  and  [  Y]  . 

This  intercluster  distance  is  the  increase  in  the  within  group  svun  of 
squares  (WGSS)  when  cluster  [f]  is  combined  with  [y1  to  form  a  new 
group.  Ward's  method  calculates  this  distance  for  all  possible  com- 
binations of  groups  and  combines  the  two  groups  with  the  smallest  dis- 
tance (djjy)  .   Thus,  the  objective  function  minimizes  the  WGSS.  This 
techniquia  favors  agglomeration  into  small,  closely  related  groups  and 
is  particularly  appropriate  in  this  application. 

Euclidean  distance  implies  orthogonal  geometry.   This  occurs  if  each 
variable  is  independent  of  all  the  other  variables.  When  the  variables 
are  highly  correlated,  a  generalized  euclidean  distance  must  be  used. 
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The  Mahalanobis  (21)  metric, 

dnj  (Xi,  Xj)  =  (Xi  -  Xj)Tw-l  CXi  -  Xj)  3.7 

accounts  for  such  interaction  by  including  the  within  scatter  matrix 
([W]),   The  [W]  is  covariance  matrix  computed  for  all  vectors  contained 
in  the  cluster.  Mahalanobis  distance  has  many  advantages  including  the 
fact  that  certain  objective  functions  become  equivalent.   It  is  also 
invariant  under  any  nonsingular  linear  transformation,  thus  eliminating 
scaling  problems. 

The  properties  of  the  Mahalanobis  metric,  coupled  with  a  theorem  in 
Bryan  (4) ,  lead  to  an  elegant  method  of  clustering  utilizing  the 
euclidean  metric.   Bryan  (4)  shows  that  Mahalanobis  distance  is  pro- 
portional to  euclidean  distance  when  the  original  vectors  are  trans- 
formed according  to  [E][X],  provided  [e][cov][e]^  =  [l],         3.8 
where  [E]  is  the  modal  matrix  described  in  Appendix  II.   This  is 
exactly  the  problem  solved  in  the  principal  component  analysis.  We 
now  require,  in  addition,  that  the  eigenvectors  be  rotated  so  that  each 
has  equal  variance.   An  equivalent  means  of  accomplishing  this  result 
in  the  transformed  matrix  [F]  is  to  scale  the  factor  scores  according 
to  [f]X  .   Standard  factor  scores  have  unit  variance  and  the  sum  of 
squares  of  each  eigenvector  is  equal  to  the  associated  eigenvalue. 
Proper  scaling  results  in  factor  scores  which  retain  the  variance  in 
the  original  covaricuice  matrix.  Thus,  the  original  geometry  is  re- 
stored, imparting  a  larger  range  to  factor  scores  associated  with  the 
most  important  principal  components.   In  this  way,  the  relative  impact 
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of  each  coinponent  is  included  in  the  factor  score  matrix. 

When  two  groups  are  combined,  the  degree  of  scatter  within  the  group 

and  the  distance  between  the  new  group  and  other  groups  are  of  interest. 

For  properly  scaled  orthogonal  geometries  as  described  above,  these  two 

measures  are  the  trace  of  the  scatter  matrix  and  the  euclidean  metric. 

The  scatter  matrix  is  confuted  from 

[Sf]=  E    (Fi  -  F)  (Fi  -  F)'^  3.9 

i=l 

where : 

[Sf]=  scatter  matrix  ^ 

n  =  number  of  members  within  the  group 

Fj^  =  a  vector  of  factor  scores  representing 
a  single  case 

F  =  the  mean  factor  score  vector  of  all  members 
of  the  group  combined. 

The  trace  of  this  matrix  is  the  sum  of  the  diagonal  elements  and  repre- 
sents the  total  variance  of  the  group  members  about  the  group  meem 
vector.   This  mean  vector  represents  the  new  group  in  computing  the 
distance  to  each  other  group  and  conceptually  is  the  hyper-centroid  of 
the  multi-dimensional  sphere  which  includes  all  of  the  group  members. 
The  cluster  scheme  used  in  this  work  finds  the  distance  between  all 
groups;  combines  the  two  closest  groups;  computes  the  trace  of  the  new 
scatter  matrix  and  then  computes  the  distance  between  the  new  group 
and  the  other  groups.   This  process  is  continued  until  all  of  the  ori- 
ginal days  have  been  combined  into  a  single  group.  To  construct  the 
dendogram,  the  total  scatter  at  each  cluster  step  is  computed  by  adding 
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the  scatter  for  the  new  group  eind  subtracting  the  scatter  of  the  two 
groups  which  were  combined  to  form  this  new  group.   Initially,  the  total 
scatter  is  zero  when  each  day  stands  alone.  When  two  days  are  combined, 
the  total  scatter  becomes  the  previous  scatter  plus  the  scatter  computed 
on  combining  these  two  days.  This  recognizes  the  fact  that  the  scatter 
between  a  day  and  itself  is  zero.  The  relative  scatter  at  each  step  is 
computed  by  dividing  the  total  scatter  accumulated  at  that  step  by  the 
total  scatter  when  all  of  the  cases  have  been  combined  to  form  one  group. 
All  of  the  original  cases  are  then  arranged  in  cluster  order  so  that 
successive  groups  which  have  the  smallest  distance  between  them  appear 
next  to  one  another.   This  procedure  begins  with  the  largest  groups  and 
systematically  arranges  the  members  of  each  subgroup  according  to  their 
similarity.  The  dendogram  displays  the  membership  of  each  group  from 
individual  cases  to  a  single  composite  group  with  the  relative  scatter 
represented  above  each  group.   The  reader  is  referred  to  Shannon  (39) 
and  Shannon  and  Brezonik  (40)  for  an  environmental  application  of 
cluster  analysis. 

An  example  using  ten  days  from  one  sanpling  station  is  now  presented  to 
illustrate  the  use  of  principal  conjjonents  and  cluster  analysis  to  char- 
acterize and  group  days  which  are  most  similar.   Table  3-1  displays  the 
reported  ozone  concentration  for  the  ten  sample  days,  selected  specifi- 
cally to  illustrate  the  cluster  example.   Start  hour   zero  represents  the 
one  hour  average  from  midnight  until  1:00  AM  and  this  convention  applies 
to  all  24  hours.  The  example  was  constructed  to  show  a  clear  separation 
between  three  basic  types  of  days. 
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First  a  covariance  matxix  about  the  mean  was  computed.  Symbolically, 

_    10 

%  ~  ^   ^dh  ^°^  ^  ^^°"'  0  to  23  3.10 

d=l 

10        _        _ 

^\,h  =  ^    ^^dh  -  V  (Xdh  -  Xj,)  3.11 

d=l 

10 

10        _ 
V  =  ^    {X^-Xi,)2  3.12 

d=l 

9 

where 

Xjj  =  mean  for  each  hour  averaged  over  ten  days 

X(3h  =  observed  concentration  for  each  hour  on  each  day 

^  Yi   h  ~  covariance  between  any  two  hours  averaged  over 
ten  days 

2 
a.      =   variance  for  hour  h. 
h 

2 
Note  that  a  q^q  is  the  variance  for  starting  hour  zero  about  the  mean. 

The  square  root  of  this  value  is  the  standard  deviation  of  hour  zero. 

In  Table  3,1,  we  find  the  mean  for  hour  zero  was  32  Mg/m^  and  the 

standard  deviation  was  30  ug/m-^.      Thus,  the  variance  contributed  by 

hoior  zero  in  this  example  is  (30)^  =  900  (yg/m-^)^. 

The  total  variance  of  the  covariance  matrix  is  the  sum  of  the  variance 

contributed  by  each  of  the  24  hours.  Symbolically, 

24 
a^^  =  2   a  2  3^-L3 

h=l  ^ 
where 

o„     =  total  variance  of  the  matrix   (the  trace) . 
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Summing  the  squared  standard  deviations  in  Table  3.1  yields 
36351  (lig/m^)  2  which  is  approximately  equal  to  the  full  precision  vari- 
ance of  36569 (pg/m^) 2  computed  for  the  ten  example  days.   Off-diagonal 
elements  such  as  a3^^2  ^^  computed  using  Equation  3.11  which  resulted 
in  659(vig/m3)2  in  this  example. 

A  real  symmetric  covariance  matrix,  having  24  rows  and  24  columns,  also 
has  24  roots  which  satisfy  its  characteristic  equation  (see  Appendix  II) 
These  roots,  called  eigenvalues,  satisfy  Equation  3.2.  For  the  example 
in  Table  3.1,  the  three  largest  roots  were  found  to  be  27390,  7141  and 
810  as  in  Dixon  (9, p.  255).   Since  the  total  variance  was  36569,  these 
roots  represent  74.9%,  19.5%  and  2.2%  of  the  variance  respectively.   Col- 
lectively, they  account  for  96.6%  of  the  differences  between  each  of  the 
ten  days  and  the  average  day.   Applying  equation  j.3  and  solving  for 
E,  results  in  the  first  principal  component  shown  in  Column  No.  1  of 
Figure  3.1.  This  vector  of  24  values  is  plotted  opposite  the  number 
75  and  above  the  mean  vector  in  Figxare  3.1.   The  variance  of  these  24 
values  about  the  origin  is  27390  which  equals  the  magnitude  of  its 
associated  latent  root.   Similarly,  eigenvectors  for  the  second  and 
third  roots  can  be  computed  and  are  shown  in  Figure  3,1.   Each  of  the 
ten  days  Ccin  now  be  approximated  by  adding  or  subtracting  various  amounts 
of  each  of  these  three  components  to  the  mean  vector.  These  various 
amoiants  are  the  factor  scores  computed  from  Equation  3.4.   In  Table  3.2, 
these  three  values  for  each  day  in  the  example  appear  in  the  three 
columns  under  unit  variance  scores.   Choosing  April  28,  starting  hoior 
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zero  is  reconstructed  to  be 

32+(1.44) (15.1)+(1.14) (23.2)+(-.21) (-8.8)  =  82  ug/m^. 
Hour  12  is  represented  by 

110+(1.44) (50.9)+(1.14)(-6.7)+(-.21) (7.3)  =  174  yg/m^. 
Similarly,  all  of  the  original  data  can  be  reconstructed  to  satisfy  the 

model  given  in  Equation  3.1.  The  principal  components  also  indicate  the 
correlation  between  hours  found  in  these  ten  days.   Looking  at  Figure  3.1, 
we  see  that  the  contribution  from  hours  nine  through  thirteen  is  approx- 
imately constant  for  both  the  first  and  second  components  with  the  mean 
accounting  for  differences  during  this  interval.  Similarly,  hours  zero 
through  three  contribute  approximately  the  same  amount  on  any  given  day 
and  the  period  from  hour  20  through  midnight  is  important  in  defining 
the  second  component. 

Factor  scores  are  independent  of  one  another  since  they  represent  the 
proportions  of  independent  components  which  make  up  each  day.  There- 
fore, they  can  be  represented  geometrically  along  perpendicular  axis. 
The  unit  variance  scores  weight  each  factor  equally.   In  the  ten  day 
example  factor  one  actually  represents  75%  of  the  difference  while  the 
second  factor  represents  only  20%.   In  order  to  achieve  the  correct 
grouping  when  these  scores  are  clustered,  the  total  variance  in  each 
component  needs  to  be  transfered  to  the  factor  scores.  As  discussed 
above,  the  easiest  way  to  do  this  is  to  multiply  each  unit  variance 
score  by  the  square  root  of  the  applicable  eigenvalue  (165.5,  84.5 
and  28.5,  respectively)  .   This  scaling  produces  zero  mecin  factor  scores 
which  have  variance  equal  to  the  original  contribution  of  each 
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component.  Thus  the  sian  of  squares  of  factor  scores  confuted  over  all 
ten  days  equals  the  proportionate  variance  and  the  associated  eigen- 
vectors each  have  unit  variance.  These  factor  scores  are  shown  in 
Table  3.2  under  Mahalanobis  scaled  scores.  They  are  called  Mahaleuiobis 
scores  because  they  yield  a  measure  of  Mahalanobis  distance  using  the 
Euclidean  metric - 

To  illustrate  the  clustering  method  both  unit  variance  and  Meihalanobis 
scaled  scores  have  been  grouped  and  are  displayed  in  Figure  3.2.  The 
numerical  example  now  presented  applies  to  the  Mahalanobis  clustering 
since  this  provides  a  distinct  grouping.   The  computation  is  the  same 
for  unit  variance  scores  and  can  be  directly  extended  to  as  meuiy  dimen- 
sions desired.  The  first  step  is  to  compute  the  euclidean  distance 
between  all  groups.  For  ten  cases,  this  is  a  collection  of  45  numbers. 
The  smallest  of  these  represents  the  distance  between  December  7  and 
November  16  and  was  computed  to  be 

(-170- (-176)) 2  +  (15-17)2  +  (38.6-17.3)2  =  22.22. 
The  largest  number  was  267.75  representing  the  distance  between  April  28 
and  May  2.   Referring  to  Table  3.2,  this  value  was  computed  from 

(239-31)2  +  {97_(-65))2  +  (-6-40.7)2  =  257.75. 

The  first  step  in  clustering  was  therefore  to  combine  December  7  and 

November  16-  Since  the  mean  factor  scores  for  these  two  days  are  Fj^=173, 

F2=16  and  F3=28,  the  trace  of  the  scatter  matrix  at  this  step  is  computed 

from  (-170- (-173) ) 2+ (-176- (-173) ) 2+(i5-16) 2+(l7-16) 2+ 
(38.6-28) 2+(17. 3-28) 2  =  245. 
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Date 


TABLE  3.2 


Factor  Scores  for  Ten  Day  Example 


Unit  Variance  Scores 


Mahalanobis  Scaled  Scores 


Fl 

F2 

F3 

Fl 

F2 

F3 

4/28 

1.44 

1.14 

-0.21 

239 

97 

-6.0 

4/20 

1.43 

1.27 

0.48 

236 

107 

13.6 

12/7 

-1.03 

0.18 

1.36 

-170 

15 

38.6 

11/16 

-1.06 

0.20 

0.61 

-176 

17 

17.3 

11/30 

-1.13 

0.73 

-1.32 

-187 

61 

-37.7 

12/14 

-0.97 

0.78 

-0.60 

-160 

66 

-17.1 

5/28 

0.33 

-1.11 

0.12 

55 

-94 

3.4 

6/12 

0.42 

-1.48 

-0.37 

70 

-125 

-10.7 

8/19 

0.38 

-0.93 

-1.48 

63 

-79 

-42.3 

5/2 

0.19 

-0.77 

1.43 

31 

-65 

40.7 

variance 

1.0 

1.0 

1.0 

27390 

7141 

810 

FIGURE   3.2 


Dendograms  for  Ten  Day  Example 
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Since  this  is  the  first  step,  the  total  cumulative  scatter  is  zero  plus 
245.   Only  the  distance  from  the  new  group  to  other  groups  needs  to  be 
recomputed.  On  the  second  iteration,  the  smallest  overall  distance  was 
found  to  be  22.24  between  April  20  and  April  28.  When  these  two  days 
were  combined,  the  total  scatter  of  the  scatter  matrix  was  determined 
to  be  247-   Thus  the  total  cumulative  scatter  after  completing  the  second 
step  was  245  +  247  =  492.   Similarly,  pairs  of  days  were  combined  for 
steps  three  and  four. 

Step  six  illustrates  the  procedure  when  two  groups,  both  containing  more 
than  one  day,  are  combined.   The  total  cxomulative  scatter  after  comple- 
tion of  step  five  was  3389.   When  December  14  amd  November  30  were  com- 
bined, the  total  variance  of  the  scatter  matrix  for  these  two  days  were 
585.  The  trace  of  the  scatter  matrix  for  the  fovu:  days  combined  was 
calculated  as  the  sum  of  12  squared  values  and  was  equal  to  12736.   Thus, 
the  total  cumulative  scatter  on  completion  of  step  six  was  15297.   We 
see  this  vmder  step  six  in  Table  3.3. 

On  completion  of  clustering  the  cumulative  scatter  was  318069  as  shown 
in  Table  3.3.  This  is  derived  from  the  total  variance  of  the  Mahalanobis 
factor  score  matrix  shown  in  Table  3.2.   Notice  that  the  mean  of  each 
set  of  ten  factor  scores  in  Table  3.2  is  zero.   Thus,  the  scatter  is 
equal  to  the  sum  of  squares.   Ten  zero  mean  values  having  a  variance 
of  27390  (Mahalanobis  Fl)  have  a  sum  of  squares  equal  to  246510.   This 
is  apparent  from  Equation  3.12,  since  nine  times  27390  equals  246510. 
Similcurly,  the  sum  of  squares  for  F2  and  F3  are  64269  and  7290.   The 
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total  sum  of  squares  for  the  Mahalanobis  scores  in  Table  3.2  is  therefore 

246510  +  64269  +  7290  =  318069. 
■Ehis  represents  96.6%  of  the  total  variance  shown  on  Page  28.   Thus, 
Mahalanobis  scores  preserve  the  original  variance.   This  is  not  true  of 
the  unit  variance  scores  which  result  in  incorrect  scaling  and  clustering 
errors . 

Table  3.3 


Ten  Day  Cluster  Example 
Scatter  Summary 


Step 
No. 

%  Relative 
Scatter 

0.08 

Cumulative 
Scatter 

243 

: 

Previous 
Scatter 

0 

+ 
+ 

New 
Group 

243  - 

Member  - 
A 

Member 
B 

1 

0  - 

0 

2 

0.16 

495 

= 

243 

+ 

252  - 

0  - 

0 

3 

0.34 

1080 

= 

495 

+ 

585  - 

0  - 

0 

4 

0.56 

1768 

= 

1080 

+ 

688  - 

0  - 

0 

5 

1.07 

3389 

= 

1768 

+ 

2309  - 

0  - 

688 

6 

4.81 

15297 

= 

3389 

+ 

12736  - 

585  - 

243 

7 

6.15 

19568 

= 

15297 

+ 

6580  - 

0  - 

2309 

8 

47.40 

150778 

= 

19568 

+150526  - 

6580  - 

12736 

9 

100.00 

318069 

= 

150778 

+318069  - 

150526  - 

252 

Note;   Member  A  and  Member  B  are  the  two  groups  being  combined  to 
form  the  New  Group. 
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When  all  clustering  steps  have  been  completed,  the  relative  scatter  is  cal- 
culated by  dividing  each  cumiilative  scatter  by  the  total  scatter  in  the 
factor  matrix.  This  is  also  equivalent  to  total  variance.  A  dendogram 
can  be  constructed  from  the  relative  scatter  by  arranging  the  days  so 
that  nearest  neighbors  are  adjacent.   The  nine  steps  and  associated 
relative  scatter  are  shown  in  Table  3.3.   Clustering  for  the  unit  vari- 
ance scores  is  done  analogously,  with  the  total  siom  of  squares  on  the 
ninth  step  equal  to  27.  The  dendogram  is  shown  in  Figure  3.2.  Although 
the  groupings  for  both  unit  variance  and  Mahalanobis  scores  are  the  same 
in  this  example,  significant  differences  in  measures  of  closeness  are 
clear.   These  differences  can  cause  grouping  errors,  especially  when 
differences  between  cases  are  not  a  pronounced  as  those  given  in  this 
example.   In  practice,  tabulating  the  scatter  values  is  not  necessai^f 
since  the  dendogram  graphically  shows  how  to  group  each  set  of  data. 

Figure  3.3  is  a  visual  aid  to  understanding  cluster  analysis.  The  fac- 
tor scores  representing  each  of  the  ten  days  have  been  plotted  in  three 
dimensions.   Clustering  can  be  accomplished  in  multiple  dimensions,  but 
more  than  three  dimensions  becomes  difficult  to  visualize.   Clearly  the 
April  dates  are  set  apart  from  the  other  eight  days.  The  November  and 
December  days  were  grouped  and  set  apart  from  the  May,  June  and  August 
days.  The  relative  scatter  and  closeness  between  groups  are  clearly 
illustrated  in  Figures  3.2  and  3.3  and  in  Table  3.3. 

Lognormal  plots  (see  Appendix  I,  p.  105   )  of  these  ten  days  are  in  Figure 
3.4.  The  same  grouping  illustrated  above  can  be  postulated.   However, 
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FIGURE  3.3 
Three  Dimensional  Display 
Factor  Scores  for  Ten  Day  Example 
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grouping  on  geometric  mean  or  standard  geometric  deviation  does  not 
necessarily  produce  the  same  clusters  derived  from  orthogonal  components. 

I  have  briefly  described  necessary  mathematical  concepts  and  presented 
an  example  to  illustrate  how  information  can  be  condensed  and  utilized 
to  identify  similarities  and  differences  among  days  in  Chapter  III.  This 
example  is  extended  in  Chapter  IV  and  applied  to  ozone  data  reported  in 
south  Florida.  Before  proceeding  to  this  analysis,  a  discussion  of 
empirical  modeling  is  next  presented  with  a  flow  chart  illustrating 
steps  in  the  procedure. 

A  posteriori  models  require  extensive  measurement  of  ambient  conditions 
over  time.  The  sampling  network  provides  these  data.   Ideally  classical 
modeling  techniques  would  be  used  to  determine  the  optimal  number,  loca- 
tion and  sampling  frequency  for  the  network.   Performance  is  ordinarily 
evaluated  after  considerable  data  have  been  collected-  These  data 
should  be  used  to  confirm  or  investigate  preliminary  classical  model 
results  and  to  suggest  desirable  changes  in  the  sampling  procedures. 
The  ultimate  application  for  stochastic  models  is  for  real  time  analysis 
utilizing  microprocessors  to  predict  future  air  quality  based  on  past 
and  present  conditions.   These  techniques  could  also  be  used  to  initiate 
control  strategies  and  signal  a  need  for  more  intensive  sampling  during 
critical  periods.  Stochastic  models  need  to  be  developed  before  real 
time  control  strategies  can  be  formulated. 
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To  build  empirical  simulation  models,  one  must  observe  enough  repeated 

experiments  to  develop  a  reliable  relationship  between  inputs  and  out- 
puts. When  conditions  change  often  and  in  random  fashion,  as  in  the 
atmosphere,  these  experiments  need  to  be  grouped  for  similarity  in  order 
to  construct  meaningful  models.   If  all  conditions  were  used  indiscrim- 
inantly,  an  "average  model"  would  be  derived  which  actually  would  not 
represent  any  of  the  observed  conditions.  Figure  3.5  illustrates  the  steps 
necessary  to  analyze  data  preparing  for  stochastic  modeling.  The  con- 
ventions used  are  tiie  same  as  those  for  critical  path  analysis.   Each 
arrow  represents  a  task  or  operation  to  be  completed.  The  numbered 
points  mark  the  initiation  and  completion  of  each  task.   Successive 
operations  cannot  begin  until  all  of  the  preceding  tasks  have  been  com- 
pleted.  The  first  step,  proceeding  from  point  zero  to  point  1,  is  col- 
lecting the  necessary  data.   It  includes  all  of  the  laboratory,  field 
and  administrative  (auditing,  etc.)  quality  assurance  procedures  neces- 
sary to  insure  valid  data.   Selecting  data  of  interest  concerns  the 
frequency  of  elevated  pollutant  levels,  duration  of  conditions,  accuracy 
of  data,  availability  of  support  data,  sampling  network  density  and 
other  considerations.  For  multivariate  problems,  covariance  matrices 
can  be  computed  after  screening  the  data  to  exclude  anomolies.  At  the 
same  time ,  preliminary  summaries,  such  as  tabulating  all  days  which  ex- 
ceed standards,  can  be  performed.   The  covariance  matrices  are  used  to 
estimate  missing  observations  and  to  compute  principal  con^onents . 
After  these  tasks  are  coit5>leted,  factor  scores  for  each  day  can  be 
found  to  represent  the  original  data  in  condensed  form.   The  principal 
component  results  are  us^d  to  further  cheiracterize  the  data  as  soon  as 
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they  become  available.   Thus,  the  dashed  arrow  between  points  fovir  and 
two  indicates  that  these  results  will  be  used  to  proceed  from  point  two 
to  point  six  (preliminary  characterization)  but  that  this  task  can  begin 
before  principal  con^ionents  and  missing  observations  have  been  confuted. 
Using  the  properly  scaled  factor  scores,  groupings  at  each  spatial  lo- 
cation can  be  computed  which  characterize  the  waveforms  as  a  function 
of  time  independent  of  space.  Using  the  preliminary  characteristic 
results,  these  clusters  can  then  be  summarized.  The  number  of  dif^ 
ferent  types  of  days  and  the  frequency  of  occurrence  as  well  as  the 

relative  similarity  can  be  determined  from  the  dendograms.  Once  these 
groups  have  been  defined,  their  characteristics  can  be  further  analyzed 
as  shown  in  Figure  3.5  at  point  seven.  Appropriate  distributions  can  be 
fit  for  each  group  (see  Appendix  I) .  These  indicate  the  probability  of 
observing  high  concentrations  in  each  group.   However,  they  do  not  pro- 
vide the  functional  relationships  in  time  that  stochastic  models  repre- 
sent.  Principal  components  for  each  subgroup  Ccin  be  computed  and  plot- 
ted along  with  the  group-average  waveform.   This  visual  display  of  group 
characteristics  shows  the  kinds  of  models  needed  and  the  periods  during 
each  day  which  are  of  particular  interest.   The  total  variance  for  each 
group  indicates  the  magnitude  of  differences  between  members  of  each 
group  and  is  an  indicator  of  the  ease  with  which  a  stochastic  model 
could  be  constructed.   Average  factor  scores  for  each  group  at  each  site 
can  be  computed  and  clustered  to  locate  similar  conditions  in  time  which 
have  occurred  at  different  spatial  locations. 
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Clustering  the  average  factor  scores  shows  spatial  as  well  as  temporal 
similarity.  Utilizing  the  svunmarized  characteristics  available  at  this 
point  results  in  major  group  types.   In  this  study,  three  types  were 
identified  and  named  non-episode,  episode  and  super  episode  conditions. 
The  subgroups  for  each  major  type  were  analyzed  and  displayed  in  cluster 
order  to  illustrate  differences  between  day  types.  The  kind  of  day 
observed  at  all  sampling  sites  was  tabulated  for  every  day  which  had  a 
super  episode  condition.   This  is  indicated  as  date  comparison  in  Figure 
3.5.   A  choice  of  the  most  interesting,  potentially  modelable  conditions 
can  be  derived  from  this  analysis.   The  next  step  in  model  development 
is  identifying  influence  parameters.   By  contrasting  conditions  on  super 
episode  days  against  non-episode  or  episode  days,  meteorological  or  air 
quality  parameters  which  seem  to  affect  the  system  can  be  found.   Having 
selected  interesting  influence  data,  formal  procedures  such  as  cluster 
analysis  or  discriminant  analysis  can  be  used  to  determine  the  relative 
importance  of  each  factor.   The  conditions  which  are  to  be  modeled  and 
input  data  can  then  be  selected.   If  sufficient  influence  data  are 
available  and  these  conditions  have  been  repeated  often  enough,  models 
can  be  formulated.   These  models  can  be  constructed  from  the  raw  data. 
Alternatively,  weighted  average  data  reconstructed  from  the  factor 
scores  could  be  used  to  smooth  random  variations  before  models  are 
constructed.  The  inputs  and  system  response  which  characterize  various 
kinds  of  days  can  then  be  used  to  evaluate  classical  models  and  air 
pollution  potential. 


CHAPTER  IV 
AN  APPLICATION  TO  URBAN  OZONE  DATA 

Photochemical  oxidant  levels  are  a  peirvasive  problem  throughout  the 
continental  United  States.  Literally  hundreds  of  journal  articles  and 
millions  of  dollars  have  been  directed  toward  this  problem  during  the 
past  few  years.   Initially,  hydrocarbon  emission  control  was  suggested 
as  an  answer  to  the  urban  oxidant  problem.  With  these  emission  con- 
trols, ozone  concentrations  remain  very  high.  Even  rural  levels  fre- 
quently exceed  established  stamdards  dturing  spring  and  summer  months. 
Reynolds  et  al.  (37)  and  Roth  et  al.  (38)  have  developed  a  comprehensive 
emission  inventory  and  a  sophisticated  oxidant  model  based  on  numerical 
integration.   Under  certain  conditions  in  the  Los  Angeles  basin,  the 
model  performs  well.   This  model  is  presently  being  modified  for  appli- 
cation in  Tampa,  Florida,  which  historically  has  had  oxidant  episodes 
lasting  several  days  with  levels  reaching  twice  the  standard.   Further 
study  is  under  way  nationally  with  a  particularly  comprehensive  program 
in  Houston,  Texas,  sponsored  by  the  Houston  Chamber  of  Conraierce  under 
Feldcamp  (13) . 

Oxidants  cire  a  secondary  pollutant,  since  they  axe   not  emitted  directly 
from  sources.   Precursor  compounds  such  as  reactive  hydrocarbons  and 
nitrogen  oxides  interact  in  complicated  chemical  reactions  which  are 
not  clearly  understood.  Intensive  solar  insolation  appears  to  be  a 
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necessary  ingredient,  or  at  least  a  contributing  factor.   A  recent 
study  reported  by  Spicer  et  al.  (43)  concluded  regional  scale  transport 
and  high  pressure  systems  must  be  considered.  Spicer  et  al.  (43)  also 
concluded  that  oxidant  sources  were  ground  based,  since  the  ozone  con- 
centration decreased  with  altitude. 

The  Hillsborough  County  Environmental  Protection  Commission  has  rou- 
tinely collected  ozone  data  using  chemiluminescent  monitors  since  1973. 
A  90%  confidence  inteirval  for  the  three-stage  calibration  procedure 
was  typically  ilO  ppb  as  shown  in  Appendix  I.   During  1974,  there  were 
82  days  and  a  total  of  589  recorded  station-hours  exceeding  159  ug/m^ 
in  the  data  set.   These  data  are  summarized  in  Table  4.1.   A  brief 
review  of  plotted  diurnal  waveforms  revealed  a  bizarre  collection  of 
cases  which  did  not  follow  the  typical  curves  given  in  USDHEW  AP-63 
(46) .   Since  oxidants  are  a  national  concern  and  since  an   extensive 
set  of  reliable  data  was  readily  available,  the  Tanpa  area  ozone  data 
were  selected  for  analysis. 

The  HCEPC  maintained  three  ozone  monitors  during  the  study  period. 
Station  63,  located  on  Davis  Islands  at  the  Coast  Guard  basin,  is  an 
urban  site  which  historically  has  reported  some  of  Tampa's  poorest  air 
quality.   A  second  urban  ozone  sampler,  designated  Station  94,  is  lo- 
cated at  the  intersection  of  Dale  Mabry  Highway  amd  Cypress  Avenue. 
It  is  less  than  ?00  yards  south  of  the  heavily  traveled  Interstate 
Highway  4.  The  station  at  the  fire  department  on  Gunn  Highway  was 
numbered  109  and  is  approximately  five  miles  northwest  of  the  urban 
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IKStS  t.l 

Contraventions  of  Lha  Oxidant  Standard 
Obsezvad  in  Hillsborough  eod  Plnallau  Countle'i,  Florida  in  1974 

SoivUn?  Station  Sa^tllnq  Station 

DME                    63  94                    109                    STP                      OME                    63                    94                    109                    STP 

2-U                      2(B)  3  (SB)                                                                 6-6                        7  (SB)              1(E) 

3-2  1(C)                                            6-12                      6  (SB) 

3-S  3(e)                                            6-13                      9(SE)              3  (SB)            10  0                     6  (SB) 

3-6  6(E)                                            6-14                      4(E)                                        2(E) 

3-9  2!E)                                         6-15                                                              .                          1{e) 

3-10                      S(SB)  3(E>                                         1(E)                  C-18                      5(SE)              3(B)                2  0                     J  (SE) 

3-11                      3  0  KB)                                                                  6-19                      2  (SB)                                      3  0                     4  (SB) 

3-l«  OiE)                                            6-21                      KB)                                                                  2(B) 

3-18                      1(E)  6-22                      2(B)                KB) 

3-19                      3  (SB)  6(SE)                                                                6-24                                                                                               KB) 

3-26                      2(B)  KB)                                            6-28                      2(P)                                                                  1(b) 

>30                      KB)  6-29                                                                                               a  (SB) 

3-31                      3(B)  2  0                                           2  (SB)                7-8                        KNS) 

4-7  KB)                                                              7-14                                          KB) 

4-10                   ■  2(B)  KSE)                7-15                      7  O                 6(SE)              KB) 

4-17  2(SE)                7-16                      7(SEJ              6(SE) 

4-18                      6(SE)  2  (SB)              7  (SB)                 a(SE)                7-17                      7(SBI              4  (SB)              S  0 

4-19                    IKSE)  5  (SB)            IKSE)                 7(S£)                7-18                      1(E)                                        5  0 

4-20                      2(B)  2  (SB)                                         7-32                      7(B) 

4-21                      KB)  7-33                    10  (SB)              5  (SB)              4(SE) 

4-24                      KB)  3(SB)                                          7-24                      6(SB)              4(B)                 5(SB) 

4-25                      2  (SB)  S(SE)                5(SE)                7-29                      9(SB)                                      1(S) 

4-26                      2(E)  3(SB>                 2(SB)                7-30                      9(SB)              7(Se>              2(SE)                 4(SE) 

4-27  4  (SB)                                          8-11                      3(E) 

4-28                      2(B)  3(SB)                                          8-14                      4(B)                3(SB)                                         3<SE) 

4-29                      a(SE)  3  (SB)              7(SB)                 KSE)                B-IS                                             KB)                 KB) 

4-30                      ao  4(SEJ              5  (SB)                                          8-19                      3(E) 

S-1                        5  (SB)  KB)                                         2(E)                  a-21                      KB)                                        2(E) 

5-2                        4(B)  KE)                                                                  8-22                                                                                               1(B) 

5-4                        2(E)  a-26                      KB)                                        3(E)                   2(E) 

S-7                        2(B)  2  (SB)                                          a-27                                                                                               2(E) 

5-8                        SOB)  6  (SB)              6(SE)                 4(SE)                8-28                                                                                               KE) 

5-9                        S(E)  3(SE)              S(SE)                 4(SE)                9-10                                                                                               3(SB) 

5-10                      7  (SB)  5(SE)              5  (SB)                                          9-17                      1(B)                KE)                                           KNE) 

5-13                      7(SE)  KE)                3(5E)                 5(SE)                9-18                                             2  (SB)                                         2(B) 

5-15  KB)                                            9-25 

5-28                      3(E)  10-8 

5-29                      4(SE)  10-10 

5-31                      2(SE)  10-15 

6-4                        2(B)  10-19 

6-5                        7<SE)  2  (SB)                                       2(SE)                12-28                    KB) 

tiotai  1)     Intaqsrs  are  total  hours  chat  day  axcseded  159  ug/n^. 

2)  SB  -  Super  Episode,  E  -  Ltiisode,  :;b  -  Ikin- Episode,  0  -  Onltted 

3)  Blank  Indicatas  no  hours  exceeding  159  Vg/a^. 

4)  Sailing  stations  described  on  Pag*  44 


KB) 

KE) 
2  (SB) 

KB) 

KSE) 

3  (BE) 
3(SE) 
3(E) 

6(SE) 

KSE) 

6(SE) 
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center.  It  is  a  semi -urban  site  on  the  periphery  between  urbcin  Tampa 
and  the  agricultioral  areas  north  of  town.   Consequently,  at  this  site 
both  urbcin  and  rural  effects  can  be  observed  depending  on  meteorology. 
The  remaining  sampling  site,  designated  STP,  was  located  in  downtown 
Saint  Petersburg. 

The  first  step  was  to  obtain  a  copy  of  the  State  of  Florida  Air  Quality 
Data  Handling  System  (AQDHS)  data  tape.   All  ozone  data  from  Hills- 
borough and  Pinellas  counties,  through  September,  1975,  were  stripped 
from  this  tape  to  direct  access  files.   These  data  were  then  analyzed 
to  locate  erroneous  values  and  days  with  more  than  50%  missing  values. 
A  number  of  errors  were  found,  checked  with  the  local  agency  and  cor- 
rected.  Days  with  too  little  data  were  removed.   The  Pearson  Correla- 
tion program  described  in  Nie  et  al.  (28)  was  then  used  to  compute  a 
correlation  matrix  for  all  remaining  data  at  each  station.   This  routine 
was  used  because  it  accommodates  missing  observations  and  because  it 
employs  the  double  precision  arithmetic  required  for  large  numbers  of 
cases.  A  covariance  matrix  was  formed  from  the  correlation  matrix  using 
the  standard  deviation  for  each  variable  (hour  of  day)  from  the  Pearson 
Correlation  program.   Principal  components  for  each  covariance  matrix 
were  computed  using  the  BMD08M  routine  from  Dixon  ( 9 ) .   This  routine 
was  selected  because  a  very  flexible  input  specification  is  provided. 
Figure  4.1  is  a  plot  of  the  mean  and  first  four  principal  components 
of  the  ozone  data  at  each  station  based  on  all  data  available.   Prin- 
cipal components  are  discussed  in  Chapter  III  and  in  Appendix  II. 
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If  all  of  the  principal  coit^onents  were  retained,  100%  of  the  total 
variance  would  remain  and  only  the  geometric  representation  of  the  data 
would  be  changed.   However,  when  the  original  variables  are  intercor- 
related,  most  of  the  information  can  be  represented  with  relatively 
few  factors.  The  criterion  used  in  this  study  was  retention  of  90% 
of  the  total  variance.   In  this  application,  greater  than  90%  of  the 
total  variance  at  each  station  was  accounted  for  by  six  components. 
Thus,  each  day  in  the  original  data  set  was  reduced  from  24  hourly 
observations  to  a  set  of  six  factor  scores  while  retaining  90%  of  the 
information  contained  in  24  values.   This  achieves  computational  effi- 
ciency and  orthogonal  structure  as  described  and  illustrated  in  the 
exan^Jle  of  Chapter  III. 

At  this  point,  the  results  of  the  principal  component  analysis  can 
best  be  presented  by  example.   Scanning  Table  4.1,  April  19  looks 
interesting  with  all  sites  exceeding  159  yg/m^  for  at  least  5  hours. 
The  mean  and  principal  components  for  Site  109,  which  are  presented  in 
Figure  4.1,  are  presented  in  Table  4.2.   The  total  variance  for  the  551 
days  in  this  site  was  25561  (yg/m-^)^.   These  ntunbers  are  shown  at  the 
top  of  Figure  4.1  above  Site  109.  The  first  six  eigenvalues  for  this 
site,  arranged  from  largest  to  smallest,  are 

13614    4882    2616    1240    880    and    525. 
Dividing  each  of  the  eigenvalues  by  the  total  variance,  we  see  that 
component  one  represents  53%  of  the  variance.   Similarly,  components 
two,  three  and  four  represent  19,  ten  and  five  percent  of  the  variance 
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respectively.  These  numbers  also  appear  in  Figure  4.1  to  the  right  of 
the  component  which  they  represent.  The  mean  for  each  hour  in  the  day 
is  shown  at  the  bottom  of  this  figure.   Standard  factor  scores  were 
computed  using  Equation  3.4.   For  April  19  at  Site  109,  these  were 

3,862    1.101    -1.476    0.883    -0.101    and    -1.418. 
These  factor  scores  have  a  zero  mean  and  unity  variance  when  computed 
over  all  days  represented.   They  represent  the  contribution  of  each 
corresponding  orthogonal  component  which  must  be  added  to  the  Site  109 
mean  vector  to  reconstruct  the  hourly  values  for  April  19.   For  example, 
the  first  hour  in  this  day  can  be  reconstructed  using  the  top  line  of 
Table  4.2  and  the  six  values  above.  Escplicitly, 

26.4  +  (3.862) (13,4)  +  (1.101)  (17.2)  +  (-1.476)  (-1.8)  + 
(0.883)  (-10.4)  +  (-0.101)  (-2.2)  +  (-1.413)  (-8.6)  =  103  ug/m"^ 

This  value  (103  Mq/m^)    appears  in  Table  4.3  as  the  first  entry  under 
reconstructed  values.   Similarly,  the  second  hour  of  April  19  at  Site 
109  was  reconstructed  as  100  ug/m^  using  the  same  factor  scores  above 
and  values  from  the  second  line  in  Table  4.2.  Table  4.3  summarizes 
the  reconstruction  of  each  of  the  24  hours  in  April  19  at  Site  109. 
The  average  difference  between  the  observed  concentration  and  the 
reconstructed  concentration  was  0.4  yg/m-^.  Since  April  19  was  an 
unusual  day  with  respect  to  the  mean  day,  reconstruction  would  be 
much  better  for  a  day  selected  at  random.  None  the  less,  reconstruc- 
tion within  16%  of  the  observed  value  was  achieved  for  every  hour. 
In  summary,  most  of  the  ozone  information  for  each  station-day  can 
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be  represented  by  only  six  values  having  orthogonal  geometry,  rather 
than  the  24  intercorrelated  hourly  values. 

All  missing  observations  were  reconstructed  using  a  multiple  regression 
routine.  Each  missing  hour  was  computed  based  on  the  distribution  over 
all  cases  for  that  specific  hovir.   The  weighting  function  used  was  the 
ozone  concentration  for  all  hours  which  were  measured.   This  procedure 
was  necessary,  since  factor  scores  cannot  be  computed  unless  all  of 
the  original  variables  (hours)  are  available.   The  factor  scores  were 
computed  and  scaled  as  described  in  Chapter  III  to  achieve  orthogonal 
Mahalanobis  structure.   This  is  accomplished  by  multiplying  each  of  the 
standard  factor  scores  by  the  square  root  of  the  corresponding  eigen- 
value. As  an  illustration,  for  April  19  at  Site  109,  the  properly 
scaled  factor  scores  are 

450.7    76.9    -75.5    31.1    -3.0    and    -32.5. 
These  values  were  computed  from  3.862^3614,  1.101^^882,  etc.   Computed 
over  all  cases,  these  factor  scores  have  zero  mean,  and  variance  equal 
to  the  eigenvalue  for  each  specific  factor.   Thus,  the  total  variance 
of  the  factor  score  matrix  would  remain  25561  (ug/m-^)^  if  all  24  factors 
were  retained.   Further,  the  variance  of  each  factor  relative  to  the 
other  factors  has  been  restored.  The  standard  factor  scores  weight  each 
factor  equally,  when  actually  factor  one  accounts  for  over  half  of  the 
total  difference  of  a  specific  day  compared  with  the  mean  day.  Using 
this  same  procedure,  factor  scores  for  each  of  the  four  sampling  sites 
were  computed  based  on  the  principal  components  for  that  specific  site 
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alone.  In  this  way,  the  simplicity  and  economy  of  the  euclidean  metric 
can  be  used  to  achieve  Mahalanobis  clustering.  The  Mahalanobis  metric 
would  also  provide  this  same  result,  but  at  considerably  greater  ex- 
pense. 

Ward's  method,  utilizing  the  euclidean  metric,  was  applied  to  all  avail- 
able days  for  1974.  A  six-space  clustering  was  constructed  using  the 
factor  scores  as  variables.   The  dendograms  in  Figures  4.2  through  4.5 
detail  the  macros true ture  of  the  separation  between  cases.   The  nianber 
of  groups  was  determined  subjectively  by  computing  average  factor  scores 
for  the  first  four  components.  The  within-group  scatter  distance  was 
used  to  make  an  initial  separation  into  approximately  35  clusters.   The 
difference  between  the  average  factor  scores  for  separate  as  well  as 
combined  groupings  was  then  used  to  achieve  a  manageable  number  of  sub- 
groups.  Each  of  the  three  numbers  in  each  block  of  these  dendograms 
is  the  average  factor  score  for  all  members  of  that  group.   The  tick 
marks  along  the  horizontal  axis  represent  cases  which  recorded  at  least 
one  hour  exceeding  159  pg/ra^.  April  19  at  Station  109  was  a  member  of 
Group  109-13.   This  notation  is  used  throughout  the  remainder  of  the 
text  to  signify  the  station  and  subgroup  corresponding  to  Figures  4.2 
through  4.5.  Group  109-13  means  Station  109  Group  13  and,  in  this  case, 
represents  a  set  of  twelve  days.  As  shown  in  Figiire  4.4,  most  days  in 
this  group  exceeded  the  oxidant  standard.  The  average  factor  scores 
for  the  first  three  principal  components  in  Group  109-13  were  316,  90 
and  -42.   These  three  numbers  appear  in  the  block  above  Group  13  in 
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Figiore  4.4.   Days  which  had  large  positive  values  for  the  first  princi- 
pal component,  generally  clustered  at  the  extreme  right  in  Figures  4.2 
through  4.5.   These  days  also  tended  to  exceed  the  oxidant  standard. 
Thus,  the  first  component,  which  accounts  for  over  50%  of  the  total 
variance,  also  tends  to  separate  high  oxidant  days  from  low  level  days. 

Figures  4.2  through  4.5  display  the  group  structure  for  each  day  repre- 
sented in  1974.   All  of  the  days  are  arranged  in  cluster  order  along 
the  abscissa.   The  dates  are  not  shown  as  their  number  is  too  large  to 
be  conveniently  displayed.   The  subgroup  numbers  appear  sequentially 
below  the  horizontal  axis.   The  rectangle  above  each  group  number  indi- 
cates both  the  number  of  days  in  the  group  and  the  relative  scatter 
among  members  of  that  group. 

Returning  to  Group  109-13  in  Figure  4.4,  the  rectangle  above  Group  13 
has  relative  scatter  equal  to  48.9%  of  the  total  scatter  (1.612X10^ (yg/ 
m3)2)  computed  for  all  269  days  in  Station  109  for  1974.   The  width  of 
the  rectangle  is  12/269  or  4.46%  of  all  days  represented.  The  horizon- 
tal line  above  Group  13  is  therefore  4.46%  of  the  length  of  the  entire 
horizontal  line.   Figiire  4.6  displays  the  clustering  of  the  twelve  days 
in  Group  109-13.   The  first  four  steps  for  this  group  combined  pairs  of 
days  ending  when  April  11  joined  with  April  21  at  a  relative  scatter  of 
8.4%.   These  were  steps  numbered  34,  50,  74  and  128  in  the  268  sequen- 
tial steps  required  to  cluster  Station  109.   At  Step  154,  the  fifth 
step  taken  for  Group  109-13,  April  27  and  May  13  were  combined  with 
April  20  and  April  28  to  form  a  new  group  at  a  relative  scatter  of  12.6%. 
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This  process  continued  and  at  Step  247,  Group  109-13  was  defined  to 
include  12  specific  days.  Note  that  we  can  easily  subdivide  this  group 
into  finer  subgroups,  which  are  more  similar  to  one  another,  or  add 
additional  groups,  such  as  Groups  14  and  15,  while  increasing  the  scat- 
ter. The  procedure  is  the  same  as  illustrated  by  example  in  Chapter 
III,  except  that  the  clustering  was  done  in  six  space  to  recover  at 
least  90%  of  the  original  information. 

Once  the  members  of  a  particular  group  are  defined,  the  eirithmetic  aver- 
age for  each  factor  can  be  computed.   This  is  useful  to  understand  how 
factors  on  the  average  account  for  differences  between  groups.  The 
first  three  factors  for  Group  109-13,  as  stated  previously,  are  316,  90 
and  -42  averaged  over  the  12  days  represented.   In  contrast.  Group 
109-10  representing  69  days  has  averaged  scores  of  -126,  14  and  -2.  As 
indicated  by  the  tick  marks  in  Figure  4.4,  most  of  the  Group  13  days 
exceeded  the  ozone  standard  while  none  of  the  Group  10  days  reached 
160  vg/m^.   Note  that  each  factor  is  zero  averaged  over  all  269  days. 
Distinct  separations  between  the  groups  displayed  in  Figures  4.2  through 
4.5  are  apparent  when  the  group-averaged  factor  scores  are  presented. 

Periods  having  similar  waveforms  lasting  a  few  days  or  longer  appear  as 
chronological  sequences  within  a  group.   Using  Group  109-13  once  again 
as  an  example,  all  except  two  of  the  days  in  this  group  occurred  in 
April.  As  shown  in  Figiire  4.6,  these  April  dates  were  10,  11,  18,  19, 
20,  21,  25,  26,  27  and  28.  Seven  of  these  ten  days  exceeded  159  ug/m"^. 
The  sequence  of  days  from  the  18th  through  the  28th  represents  a 
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FIGURE  4.6 

Clustering  Steps   for  the  Twelve  Days  in 
Group   109-13 
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quasi-stationary  period  at  Site  109  with  a  change  in  conditions  between 
the  21st  and  25th.   This  period  represents  a  sequential  set  of  days 
during  which  each  day  had  a  similar  waveform.   If  sufficient  influence 
parameters  are  available  with  fine  enough  temporal  resolution,  a  digital 
simulation  relating  observed  ozone  concentration  to  input  parameters  can 
be  developed  for  this  group.   Traditional  transfer  functions,  derived 
empirically,  can  be  used  to  develop  a  stochastic  model  representing 
conditions  on  April  13  through  April  28,  and  differences  in  these  influ- 
ence parameters  on  April  22,  23  and  24  could  provide  insight  into  causes  of 
elevated  oxidant  levels  for  Group  109-13. 

In  addition  to  temporal  stationarity,  spatial  homogeneity  is  highly 
desirable.   If  periods  having  similar  waveforms  at  all  spatial  locations 
were  defined,  mesoscale  influence  parauneters  could  be  used  to  construct 
the  model.   Otherwise  microscale  data,  collected  at  each  sampling  site, 
would  be  necessary.   Spatio-temporal  homogeneity  was  investigated  using 
the  cluster  algorithm.   A  four-space  grouping  based  on  the  average  fac- 
tor scores  for  each  subgroup  at  each  station  appears  in  Figure  4.7. 
The  complete  tree  structure  has  been  diagramed  with  the  sampling  site 
and  subgroup  number  at  the  left  margin.  A  total  of  66  station  groups 
were  analyzed.   This  dendogram  resolved  three  basic  group  types  denoted 
NE  (non-episode),  E  (episode)  and  SE  (super  episode).  Group  NE  con- 
tained 15,192  station  hours,  of  which  only  two  occurrences  exceeding 
159  ng/m^  were  found.   All  twelve  months  of  the  year  were  represented 
at  all  four  sampling  sites.   A  total  of  140  values  exceeding  159  ug/m^ 
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were  found  in  the  10,104  station  hours  represented  in  Group  E.  The 
2,544  station  hours  in  Group  SE  included  402  such  values.   The  SE,  E, 
and  NE  groupings  were  based  on  spatial  location  as  well  as  temporal 
waveform.   In  percent  of  total  hours  represented.  Group  SE  had  16%  over 
159  pg/m^  while  Groups  E  and  NE  had  1.4%  and  0.013%  respectively.   An 
example  of  a  spatially  homogeneous  day  is  April  19  which  was  an  SE  day 
at  all  four  sampling  stations.  As  shown  in  Figure  4.6,  April  19  is 
also  a  member  of  Group  109-13. 

To  further  investigate  spatial  homogeneity,  a  date  matching  routine  was 
developed.  Table  4.4  was  made  from  a  computer  output  for  Group  SE. 
All  dates  for  which  at  least  one  site  belonged  to  this  group  are  listed 
chronologically  at  the  left.   Under  each  of  the  four  stations,  the  group 
membership  is  summarized.   M  and  O  are  for  missing  and  omitted  data 
respectively.   In  all,  52  days  were  represented.   Only  five  of  these 
days  (denoted  *)  did  not  include  at  least  one  station-hour  exceeding 
159  yg/m^,  and  virtually  all  of  the  extensive  occurrences  of  elevated 
levels  were  included  in  this  group.   Spatial  homogeneity  within  Group 
SE  is  apparent  in  Table  4.4.  On  nine  days,  denoted  +  in  the  table,  all 
of  the  stations  represented  recorded  type  SE  days.  On  these  occasions, 
macroscale  phenomena  created  elevated  oxidant  levels  throughout  the 
region.   Presumably,  site  specific  differences  played  a  small  role  in 
oxidant  formation,  so  macroscale  influence  parameters  could  be  selected 
to  construct  simulation  models  representing  these  days.   Table  4.4  also 
shows  occasions  when  spatial  differences  were  significant.   For  example, 
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Data 

2-13 
3-10 
3-14 
3-19 
3-30 
3-31 

4-  a* 

4-10 

4-11* 

4-17 

4-18-F 

4-19+ 

4-20 

4-21 

4-24 

4-25 

4-26 

4-27 

4-28 

4-29+ 

4-30+ 

5-  1 
5-  7 
5-  8+ 

5-  9 
5-10 
S-13 
S-29 
5-31 

6-  5+ 
6-  6 
6-12 
6-13+ 
6-18 
6>1» 
6-22 
6-29 
7-lS 
7-16 
7-17 
7-23 
7-24 
7-29 
7-30f 
8-14 
»-10 
9-18 

10-  1* 
10-  2* 
10-  8* 
10-10 
10-19+ 


Days  Represented  In  Group  SE. 
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scanning  Table  4.4  for  type  NE  days,  one  finds  five  occasions  where  a 
super  episode  occiirred  at  at  least  one  site  while  a  low  ozone  day  occxir- 
red  at  another.   July  23  is  a  striking  example  when  persistant  elevated 
ozone  was  reported  at  all  three  Tampa  locations  while  the  St.  Petersburg 
site  remained  low.  Clearly,  there  are  occasions  when  site  specific 
influence  data  would  be  required  to  explain  spatial  differences.  How- 
ever, most  of  the  days  having  at  least  one  SE-type  station  day  also  had 
type  SE  or  type  E  conditions  at  the  other  sampling  sites.   The  degree 
of  spatial  homogeneity  exhibited  on  days  chosen  for  modeling  affects 
the  degree  of  spatial  resolution  required  in  the  influence  parameters. 

Coupling  spatial  homogeneity  with  temporal  homogeneity  provides  a  means 
of  selecting  periods  which  should  be  most  easily  modeled.   Intuitively, 
these  would  be  periods  where  the  entire  region  experienced  similar  air 
quality  which  lasted  a  relatively  long  time.  Returning  to  Table  4.4, 
a  scan  for  consecutive  dates  reveals  one  interval  lasting  eight  days 
from  April  24  through  May  1.   Other  intervals  in  Table  4.4  lasting  from 
two  to  five  days  can  also  be  found  which  represent  quasi-stationary 
high-oxidant  episodes.  Of  the  nine  days  previously  mentioned  which 
exhibited  SE  conditions  at  all  sites,  only  October  19  does  not  occur 
during  one  of  these  quasi-stationary  periods.   If  sufficient  influence 
parameter  data  were  available,  these  intervals  would  be  good  candidates 
for  simulation  modeling.  The  scope  of  this  dissertacion  does  not 
include  construction  of  digital  models,  since  the  primary  objective  is 
demonstrating  a  method  of  defining  quasi-stationary  intervals  for  which 
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stochastic  models  could  be  developed.   Consequently,  the  remainder  of 
this  chapter  concerns  statistical  and  meteorological  descriptions  which 
characterize  the  siibgroups  defined  through  pattern  recognition. 

Lognormal  parameters  were  calculated  for  each  station  day.  The  average 
geometric  mean  and  geometric  standard  deviations  were  also  calciilated. 
Group  membership  and  average  lognormal  parameters  appear  in  Table  4.5, 
including  the  number  of  station  days  represented.  The  overall  groups 
in  Table  4.5  refer  to  group  numbers  opposite  the  brackets  in  Figure  4.7. 
Membership  in  each  group  from  1  through  14  is  shown  in  both  Table  4,5 
and  Figure  4.7.   These  distributions  have  been  plotted  in  Figure  4.8. 
Individual  curves  can  be  identified  by  locating  the  left-most  line  at 
the  lower  right-hand  corner  of  each  graph.   The  subgroup  order  begins 
with  this  curve  and  lists  the  group  from  left  to  right.   Group  1  had 
slightly  greater  slope  than  Group  2,  but  a  very  similar  geometric  mean 
near  27  wg/m^.   Group  E  had  an  overall  geometric  mean  of  49  at  slope 
2.63.   Groups  3,  4  and  5  appear  together  with  an  average  geometric  mean 
around  38  ug/m-^.   Their  standard  deviations  ranged  from  2.77  to  3.98. 
Groups  6  and  7  were  close  together  and  averaged  63  pg/m-^.   These  two 
curves  had  the  lowest  slope  of  any  groups  indicating  elevated  levels 
throughout  the  day,  but  few  values  reaching  160  yg/m-^.   (Note:  The 
highest  in  24  values  occurs  with  a  frequency  of  2.8%)..   Group  8,  with  a  geo- 
metric mean  of  76  and  slope  of  2.1  was  more  closely  aligned  with  Groups 
6  and  7,  as  would  be  expected  from  the  dendogram  structxire  in  Figure 
4.6.   The  super  episode  group  had  an  averaged  geometric  mean  of  72.   All 
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TABLE  4.5 

Average  Lognormal  Parameters  for 
Sunmiary  Groups  and  Subgroup  Membership 


Group 

Group 

Station 

Average 

Average 

Mo. 

Members 

Days 

Geo  Mean 

Geo  Dgv. 

I 

63-  1 
94-  8 

STP-  1 
109-11 

63-  2 

STF-  6 

246 

26.8 

2.70 

2 

53-  3 
STP-  5 

63-10 
STP-  3 

94-  7 

STP-  4 

94-  9 

63-12 

109-10 

STP-  2 

94-10 

387 

27.3 

2.49 

3 

63-  4 
STP-10 

STP-14 

94-  4 

56 

25.6 

3.98 

4 

63-  5 

94-  6 

109-  5 

109-  9 

94-14 

STP-16 
109-  8 
STP-15 
STP-13 
STP-  9 

63-  8 
94-  5 
63-  6 
G3-  7 
94-13 

169 

39.6 

3,07 

5 

63-  9 

€3-11 

STP-  7 

109-  3 

STP-12 

STP-  0 

94-  1 

94-  2 

109-  2 
94-  3 

STP-11 
63-13 

23 

51.9 

2.77 

6 

63-  9 

STP-12 

109-  2 

125 

60.4 

1.77 

7 

94-12 

109-  4 

109-  1 

30 

71.3 

1.71 

8 

63-14 

109-12 

21 

75.9 

2.11 

9 

63-15 

109-15 

63-18 

109-14 
109-  6 
STP-19 

94-11 
109-  7 

63 

51. S 

3.61 

10 

63-16 

STP-17 

109-13 

34 

102.4 

1.64 

11 

63-17 

STP-18 

9 

100.3 

2.19 

12  (NE) 

1 

2 

633 

27.2 

2.57 

13  (E) 

3 
6 

4 
7 

5 

8 

424 

48.9 

2.63 

14  (SE) 


10 


U 


106 


72.0 


2.86 


Notes:  1)  Group  numbers  1-14  are  shovm  in  Figure  4.6. 

2)  Group  members  show  station  and  subgroup  number  represented. 

3)  Average  geoinean  is  the  average  geometric  mean  for  all 
station  days  in  the  group 
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three  of  the  subgroups  which  made  up  Group  SE  predict  daily  contraven- 
tions of  the  oxidant  standard.   Group  9  had  a  geometric  mean  of  only 
52,  but  a  slope  of  3.6.  This  indicates  relatively  low  levels  during 
certain  times  of  the  day  with  very  high  excursions  at  other  times. 
Group  9  consisted  of  63  days  and  represented  nearly  60%  of  all  of  the 
SE  type  cases.  Group  11  included  a  total  of  nine  days  and  was  limited 
to  station  63  and  the  site  in  St.  Petersburg.  With  a  geometric  mean  of 
100  at  a  slope  of  2.19,  all  of  the  members  of  this  group  would  also  be 
expected  to  exceed  the  air  quality  standard.   Group  10,  at  102  and  1.64 
respectively,  predicts  contraventions  of  160  ]ig/m^   on  the  average  but 
includes  some  days  which  statistically  might  not  exceed  this  limit. 

Figure  4.9  is  a  plot  of  the  mean  and  first  two  principal  components  for 
the  subgroups  listed  in  Figure  4.7.   A  cross  reference  table  is  provided 
as  a  guide  to  locate  groups  in  Figure  4.9.  All  groups  shown  in  Figure 
4.7  or  Figures  4.2  through  4.5  can  be  found  in  Table  4.6.   The  plots 
are  arranged  in  cluster  order  left  to  right  and  top  to  bottom.  The 
number  of  days  in  each  group  appears  at  the  upper  left  corner  in  each 
figure.   The  number  in  the  upper  right  corner  is  the  total  variance  of 
all  days  in  the  group  about  the  mean  vector.  The  number  at  the  left  of 
each  principal  component  is  the  percent  of  the  total  variance  explained 
by  that  eigenvector.  Figure  4.9  is  very  useful  for  visualizing  differ- 
ences between  observed  waveforms.  Averaging  smooths  abrupt  changes 
which  confound  the  analysis  of  individual  station  days.   Thus,  the  mean 
vector  for  each  siibgroup  is  easier  to  interpret  than  the  waveform  of  a 
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PIGUHE  4.9  b  Haan  and  Principal  Oomponenta  of  Subgroupa 
Typa  SB  Oaya  Condnued 
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naOBE  4.9  d     Haan  and  Principal  Components  at  Subgroups 
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FIGURE  4.9  •     Mean  and  Principal  Components  of  Subgroups 
Typs  NE  Days  Continued 
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fICURE  4.9  f  Mean  and  Principal  Conponunta  of  Subgroups 
Type  C  Cdy3 
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FIGURE  4.9  J     Haan  and  Principal  Conponents  of  Subgroups 
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FICUBI!  4.9  k     Hean  and  Principal  Canponents  ot  Subgroups 
Type  E  Days  Continued 
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specific  day.  Differences  between  individual  members  of  each  group  can 
be  visualized  in  terms  of  the  first  and  second  principal  components. 
Large  positive  or  negative  peaks  in  these  cxrcves  indicate  times  during 
the  day  when  differences  between  group  members  occurred.  These  compo- 
nents probably  have  physical  significance  in  terms  of  meteorological  or 
other  influence  parameters.   Many  of  the  components  display  a  periodic 
nature  having  24-hour,  12-hour  and  5-hour  cycles.   In  addition,  both 
mean  and  component  vectors  seem  to  be  repeated  among  various  subgroups. 
It  is  also  apparent  that  many  of  the  sv±)groups  exhibit  unusual  waveforms 
which  are  not  characteristic  of  the  pattern  expected  from  classical 
photochemical  considerations.  Figure  4.9  thus  forms  a  basis  for  fur- 
ther investigation  of  oxidant  phenomena.  A  choice  of  subgroups  to  model 
and  possible  characteristics  of  the  influence  parameters  become  much 
clearer  after  studying  Figiare  4.9  and  Tables  4.1  and  4.5. 

To  illustrate  the  use  of  Figure  4.9,  an  example  using  Group  109-13  is 
now  presented.  To  locate  this  group,  we  find  Station  109-13  as  the 
first  element  of  the  fifth  column  in  Table  4.6.   The  entry  opposite 
109-13  is  4.9b  which  is  page  71  and  we  find  Group  109-13  in  the  middle 
at  the  bottom  of  this  page.   The  mean  vector  was  computed  to  be  75  pg/m-^ 
from  midnight  until  about  4:00  AM  followed  by  a  drop  in  concentration 
to  50  yg/m^  about  5:30  AM.   The  ozone  level  then  climbs  linearly  to 
reach  a  maximum  of  160  ng/m^  arovmd  12:00  noon.   From  noon  through 
5:00  PM,  the  average  ozone  concentration  for  these  12  days  remained 
near  160  Mg/m-^  and  dropped  to  about  110  yg/m^  by  11:00  PM  in 
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approximately  linear  fashion.  As  shown  in  Figure  4.7,  Group  STP-17  is 
most  similar  to  Group  109-13  and  the  relative  scatter  was  about  2.5% 
when  these  two  groups  were  combined.  Since  these  graphs  were  plotted 
in  cluster  order.  Group  STP-17  appears  immediately  to  the  left  of  Group 
109-13  in  Figure  4.9b-   The  similarity  between  average  waveforms  is 
easily  recognized. 

The  first  and  second  principal  components  for  Groups  109-13  and  STP-17 
have  a  relatively  smooth  form.   This  indicates  that  differences  between 
individual  days  occur  over  a  span  of  a  few  hours  rather  than  during  a 
single  hour.  The  exception  is  the  peak  at  about  7:00  AM  in  the  first 
two  components  for  Group  STP-17.   These  peaks  tell  us  that  much  of  the 
variation  between  days  for  this  group  occurred  during  the  concentration 
drop  from  100  yg/m-^  to  50  \ig/m^   at  about  7:00  AM.   Thus,  some  days 
exhibited  less  change  in  concentration  at  7:00  than  the  average  day, 
while  other  days  had  an  even  sharper  change  than  the  mean  day.   However, 
over  all  24  hoiors,  both  of  these  groups  tended  to  have  either  higher  or 
lower  morning  values  than  the  mean  day.   This  is  evident  in  the  first 
component  which  represents  over  40%  of  the  variation  about  the  mean. 
A  gradual  change  in  these  weighting  functions  from  20  to  neeir  zero  oc- 
curs from  midnight  until  3:00  PM.   This  means  that  more  variation 
occurred  in  the  early  morning  concentrations  than  in  the  early  afternoon 
values . 

The  total  variances  about  the  mean  vector  over  all  24  hoxirs  were  11334 
and  12640  (yg/m^)^  for  Groups  109-13  and  STP-17  computed  over  12  and  16 
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days  respectively.  These  values  appear  at  the  top  of  each  graph  in 
Figure  4.9b.  These  low  values  confirm  the  relative  similarity  among 
days  in  these  two  groups.  These  two  groups  would  be  good  candidates 
for  simulation  modeling  of  certain  super  episode  days.  Similarly  Group 
94-11,  having  a  total  variance  of  24726  (yg/m3)2  for  22  days,  is  a 
group  for  which  models  could  be  constructed.   In  contrast.  Group  63-17 
(Figure  4.9b)  shows  that  some  days  tend  to  be  unique  with  large  changes 
in  concentration  over  short  time  periods.  Turning  to  members  of  the  NE 
group,  the  total  variance  about  the  mean  is  much  lower  since  days  tend 
to  be  very  similar  to  one  another.  These  days  would  be  easiest  to 
model,  but  are  also  of  less  interest  because  the  oxidant  standard  was 
rarely  exceeded.   Models  to  represent  these  days  may  be  useful  to  under- 
stand underlying  physiochemical  and  meteorological  conditions  which 
restrict  or  reduce  oxidant  generation.  Naturally,  a  large  collection 
of  very  similar  days  is  much  easier  to  model  than  a  few  unusual  days. 
Figure  4.9  reveals  those  groups  which  would  be  easiest  to  simulate  as 
well  as  graphically  presenting  the  basic  characteristics  of  all  of  the 
days  represented  in  1974. 

The  objective  in  Chapter  IV  is  to  illustrate  the  methods  presented  in 
Chapter  III  through  an  application  of  real  data.  Figure  4.9  provides 
an  effective  means  of  breaking  the  analysis  of  the  oxidant  problem  in 
South  Florida  into  a  collection  of  smaller  problems.   Groups  of  similar 
days  can  be  further  studied  using  a  large  number  of  other  mathematical 
techniques.   In  addition  to  studying  the  interaction  of  influence 
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parameters  within  a  group,  Figxire  4.9  provides  an  approach  to  studying 
differences  between  groups.   Contrasting  difference  in  influence  para- 
meters, using  powerful  statistical  tools  such  as  stepwise  discriminate 
analysis,  could  be  used  to  identify  those  parameters  having  the  greatest 
impact  on  oxidant  levels.   Once  these  are  identified,  the  time  evolution 
of  ozone  as  a  function  of  these  parameters  can  be  studied,  provided  the 
patterns  of  interest  have  been  observed  sufficiently  often.  The  tech- 
nique put  forth  in  this  dissertation  has  wide  application  in  other 
fields  as  well  as  in  air  pollution.   It  provides  a  systematic  method 
of  condensing  useful  information  and  coirparing  it.  A  better  under- 
standing of  the  problem  as  well  as  possibilities  for  potentially  suc- 
cessful approaches  to  further  analysis  results- 
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Table  4.6 

Cross  Reference  Summary 
Subgroups  from  Figiires  4.2  Through  4.5 


Group 

Figure 

Group 

Figure 

Group 

Figure 

Name 

No. 

Neune 

No. 

Ncime 

No. 

63-  1 

4.9c 

94-   5 

4.9g 

109-13 

4.9b 

63-  2 

4.9c 

94-  6 

4.9g 

109-14 

4.9a 

63-   3 

4.9d 

94-   7 

4.9d 

109-15 

4.9a 

63-  4 

4.9f 

94-   8 

4.9c 

STP-  1 

4.9c 

63-   5 

4.9f 

94-   9 

4.9e 

STP-   2 

4.9d 

63-  6 

4.9h 

94-10 

4.9e 

STP-   3 

4.9e 

63-  7 

4.9h 

94-11 

4.9a 

STP-   4 

4.9d 

63-  8 

4.9g 

94-12 

4.9k 

STP-   5 

4.9d 

63-  9 

4.9i 

94-13 

4.9i 

STP-   6 

4.9c 

63-10 

4.9e 

94-14 

4.9h 

STP-   7 

4.9j 

63-11 

4.9i 

109-    1 

4.9k 

STP-   8 

4.9i 

63-12 

4.9e 

109-   2 

4.9i 

STP-   9 

4.9h 

63-13 

4.9k 

109-    3 

4.9j 

STP- 10 

4.9f 

63-14 

4.9k 

109-   4 

4.9k 

STP- 11 

4.9j 

63-15 

4.9a 

109-   5 

4.9g 

STP- 12 

4.9i 

63-16 

4.9b 

109-   6 

4.9a 

STP- 13 

4.9h 

63-17 

4.9b 

109-    7 

4.9a 

STP-14 

4.9f 

63-18 

4.9b 

109-   8 

4.9g 

STP-15 

4.9g 

94-  1 

4.9j 

109-   9 

4.9h 

STP-16 

4.9f 

94-   2 

4.9j 

109-10 

4.9d 

STP-17 

4.9b 

94-   3 

4.9j 

109-11 

4.9c 

STP-18 

none 

94-  4 

4.9f 

109-12 

4.9k 

STP-19 

4.9b 

Note:  1)   63-1  means  Station  63  Subgroup  No.  1. 
2)  4.9a  to  4.9k  refer  to  pages  70  to  80. 


CHAPTER  V 
SUMMARY  AND  CONCLUSIONS 

Atmospheric  processes  are  inherently  random  and  they  change  randomly 
over  time.  Statisticians  call  this  kind  of  behavior  a  non-stationary 
stochastic  system.   Modeling  air  pollution  is  difficiilt  because  the 
response  to  influence  parameters  changes  with  time.  Classical  a  priori 
models  treat  this  problem  by  assigning  stability  types  and  making  assump- 
tions which  allow  approximate  solutions  to  the  pollutant  transport 
equation.   Stochastic  models  are  better  suited  to  simulating  air  quality 
because  they  represent  empirical  relationships  between  observed  para- 
meters emphasizing  their  random  nature.   These  models  can  provide  the 
greatly  needed  means  to  better  understand  and  control  air  quality. 
Atmospheric  siarveillance  is  becoming  more  intensive  and  more  sophis- 
ticated.  Naturally,  stochastic  models  for  air  pollution  will  become 
more  popular  when  highly  reliable  and  extensive  sets  of  data  become 
available.   However,  the  non-stationary  behavior  of  these  systems  must 
be  recognized  and  treated  in  order  to  develop  meaningful  stochastic 
models . 

This  dissertation  described  some  limitations  to  modeling  air  pollution 
and  sets  forth  a  method  for  segregating  data  having  similar  attributes. 
Although  a  number  of  methods  are  possible,  the  most  importcuit  principle 
is  to  recognize  the  need  for  many  models.  A  single  model  to  represent 
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all  of  the  observed  conditions  is  not  possible.  Therefore,  the  first 
step  in  building  stochastic  air  quality  models  is  to  find  sets  of  condi- 
tions which  could  be  modeled.  TO  qualify,  the  same  conditions  must  have 
occurred  often  enough  so  that  repeated  experiments  in  the  set  are  not 
too  different  from  one  another.  Otherwise,  the  "system"  being  modeled 
may  actually  represent  multiple  systems.  Note  that  rapidly  changing 
conditions  require  frequent  sampling  and  that  significant  spatial  dif- 
ferences require  a  dense  sampling  network.   Hourly  averages  at  four 
sampling  sites  may  be  sufficient  to  represent  some  conditions  but  gros- 
sly inadequate  for  others  if  very  rapid  changes  occurred.  After  choosing 
the  appropriate  conditions,  reliable  influence  data  can  be  analyzed  to 
construct  a  model.  Similarly,  models  representing  particular  conditions 
could  be  formulated  to  include  most  of  the  obseirved  data. 

The  approach  to  treating  non- static nary  data  presented  in  this  work 
utilized  an  assumption  so  that  multivariate  methods  applied.  Each  hour 
in  a  day  was  considered  as  a  distinct  variable  since  ozone  is  related 
to  the  intensity  of  incident  radiation.   This  assiamption  was  not  neces- 
sary but  greatly  simplified  formulating  the  pattern  recognition  problem. 
A  generalized  grouping  method  could  be  developed  to  seek  inteirvals  having 
similar  system  response.   The  length  of  these  intervals  does  not  need  to 
be  constant,  and  could  be  initialized  by  passage  of  a  front  for  example. 
However,  to  develop  stochastic  air  pollution  models,  one  must  recognize 
the  need  for  repeated  observations  of  similar  conditions  and  methods  to 
determine  when  these  have  occarred. 
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Atmospheric  modeling  is  a  process  of  simulating  a  waveform  and  comparing 
it  with  an  observed  waveform  which  represents  the  same  condition.  The 
simulated  results  are  subtracted  from  the  observed  results  to  yield  a 
new  waveform  called  the  residuals.  If  the  residuals  are  small  compared 
with  the  original  data,  the  model  represents  those  conditions  well. 
Further  treating  the  residuals  can  lead  to  better  models  until  ulti- 
mately what  is  left  is  that  purely  random  component  known  as  white  noise. 
The  models  used  for  each  successive  step  can  be  deterministic,  as  in 
classical  modeling,  or  stochastic.  Either  type  of  model  represents  a 
relationship  between  inputs  and  outputs.   Both  deterministic  and  sto- 
chastic models  have  characteristic  frequency  response  curves  relating 
these  inputs  and  outputs.   Classical  models  develop  this  relationship 
from  physical  principles  while  stochastic  models  rely  on  empirical 
relationships.  But  the  final  test  of  each  model  is  its  ability  to 
reconstruct  the  observed  data  and  yield  a  purely  random  residual.  Models 
fail  when  the  influence  data  are  inadequate  and  when  functional  coeffi- 
cients are  in  error.   It  really  doesn't  matter  how  the  model  was  devel- 
oped, so  long  as  the  correct  influence  parameters  and  coefficients  have 
been  determined  and  can  be  relied  upon  to  correctly  simulate  the  condi- 
tions represented.   Pattern  recognition  used  on  observed  data  can  provide 
test  data  to  compare  with  a  priori  models.  A  priori  models  can  be  used 
to  suggest  influence  parameters  and  to  maJce  assumptions  in  applying 
pattern  recognition  tools.  Both  techniques  should  be  used  together. 
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A  pattern  recognition  approach  to  dealing  with  the  non-stationary 
waveforms  found  in  air  pollution  was  presented  for  ozone  data  reported 
at  four  sampling  sites  in  Tampa  and  Saint  Petersburg,  Florida.   These 
data  were  selected  because  the  oxidant  problem  is  a  national  concern 
and  accurate  data  were  available.  Each  day  was  reduced  from  24  hourly 
observations  to  a  set  of  six  values  while  retaining  over  90%  of  the 
available  information.   These  six  values  represented  proportionate 
contributions  of  six  independent  pseudo  variables  each  of  which  was 
made  up  from  the  original  24  hours.   With  proper  scaling,  the  euclidean 
metric  was  used  to  measure  distance  between  each  day  and  the  most  simi- 
lar days  were  combined  into  common  groups.  Each  station  was  analyzed 
separately  and  the  results  were  displayed  graphically  as  a  dendogram. 
Days  which  experienced  elevated  oxidant  levels  tended  to  be  closely 
associated  in  the  dendogram.  Similarly,  days  with  low  ozone  levels 
were  fovmd  together.  The  average  factor  scores  from  each  group  were 
used  to  determine  spatially  similar  groups.   This  analysis  resulted  in 
the  super  episode  (SE)  ,  episode  (E)  ,  eind  non-episode  (NE)  classifica- 
tion which  was  used  to  contrast  meteorological  conditions.  As  des- 
cribed in  Appendix  III,  intense  solar  insolation  following  passage  of 
a  cold  front  and  the  associated  high  pressure  system,  accounts  for 
many  super  episode  days.   A  clear  explanation  of  differences  between 
E  and  NE  type  days  was  not  possible  from  the  limited  meteorological 
data  available.   The  mean  and  first  two  principal  components  for  each 
group  at  each  station  were  plotted  in  cluster  order.  This  graphical 
display  demonstrated  the  existance  of  a  number  of  distinct  types  of 
days  which  appear  to  require  basically  different  models. 
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Average  lognormal  parameters  for  groups  of  days  were  computed  and  the 
distributions  were  plotted.  These  also  demonstrated  distinct  differ- 
ences between  types  of  days  and  showed  that  there  is  no  single  under- 
lying distribution  representing  all  observed  conditions. 

In  many  cases  ozone  behavior  was  found  to  be  unusual  with  respect  to 
classical  photochemical  oxidant  theory. 

Pattern  recognition  provides  a  method  of  hcuidling  the  randomly-changing 
non-stationciry  conditions  encountered  in  atmospheric  studies.   This 
application  clearly  illustrated  the  usefullness  of  pattern  recognition 
as  a  diagnostic  tool  in  air  pollution  research. 

A  number  of  significant  conclusions  can  be  drawn  from  the  results  of 
this  work.   All  of  them  focus  on  the  problem  of  stationary.   Fitting 
confidence  bounds  on  the  population  mecui  or  maximum  and  computing  order 
statistics  from  a  sample  require  a  single  stationary  distribution.  It 
is  very  apparent  that  the  density  function  of  air  quality  data  changes 
with  time.   Consequently,  predicting  the  probability  that  a  prescribed 
level  was  exceeded  or  will  be  exceeded  in  the  future  can  not  be  done 
with  certainty.   Influence  parameters  can  be  identified  and  functional 
relationships  can  be  developed,  but  there  is  no  guarantee  that  these 
same  influence  parameters  will  be  important  under  another  set  of  con- 
ditions.  It  is  also  possible  that  influence  parameters  which  are  con- 
sidered to  be  insignificant  may  operate  under  some  circumstances  to 
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exercise  a  predominant  influence  on  air  quality.  Building  models  to 
describe  air  pollution,  given  this  lack  of  constraints,  is  indeed  com- 
plicated, and  a  single  model  capable  of  including  all  of  these  possible 
conditions  is  difficult  to  imagine.  Classifying  and  analyzing  observed 
air  quality  provides  an  alternative  approach  which  divides  the  problem 
into  managable  parts.   If  the  system  changed  in  a  gradual  predictable 
way,  forecasting  could  be  accomplished  using  adaptive  filters  with  coef- 
ficients that  change  predictably.   Unfortiinately,  conditions  in  the 
atmosphere  can  change  abruptly  to  require  a  new  set  of  influence  para- 
meters and  different  functional  relationships.  But  a  comprehensive 
history  of  repeated  events  can  be  used  to  determine  which  model  is 
appropriate.  Real-time  analysis  of  atmospheric  processes  can  be  devel- 
oped after  extensive  collection  and  analysis  of  data  have  transpired. 
Classifying  conditions  and  rearranging  them  into  chronological  order 
produces  sequences  of  events.   There  are  certain  patterns  to  these  se- 
quences which  can  be  used  to  predict  when  an  air  pollution  episode  is 
likely  to  occur.  Mathematical  models  to  simulate  each  kind  of  day  cuid 
other  models  which  predict  which  type  of  model  is  most  probable  are  the 
basis  for  real-time  prediction  and  stochastic  simulation  of  air  quality 
data. 


APPENDIX  I 
AIR  POLLUTION  STATISTICS 

Many  researchers  interested  in  air  pollution  do  not  have  a  strong  back- 
ground in  statistics.  This  section  is  designed  to  review  some  basic 
concepts  with  particular  emphasis  on  popular  distributions  used  to 
describe  air  pollution.  For  a  detailed  comprehensive  treatment,  the 
reader  is  referred  to  Aitcheson  and  Brown  (1) ,  Arnold  (2) ,  Doob  (10) 
and  Parzen  (29) . 

A  random  variable  is  a  function  whose  values  depend  on  the  outcome 
of  a  chcince  event.   If  we  denote  a  random  variable  x  and  observed  values 
?,  a  random  process  is  intuitively  a  set  of  random  variables  indexed 
to  include  the  notion  of  time.  A  particular  set  of  obseirvations  .-., 
C(-l)/  5(0),  5(1),...  is  called  a  sample  function  of  the  stochastic 
process.  The  random  variable  x  can  be  explicitly  defined  in  terms 
of  a  probability  density  function  (PDF)  and  a  cumulative  distribution 
function  (CDF).  A  PDF  (f(5))  is  a  weighting  function  which  specifies 
the  frequency  of  occurrence  of  any  real  constant  E,   over  all  possible 
values.  The  essential  property  is  that 

J_"f(C)d?=1.0. 
A  CDF  (Fjj(5))  is  the  probability  that  x  is  less  than  or  equal  to  ^. 
Thus  Fjj(5)  =  Pr(x^5)  =  /«fx(y)dy.  AI.l 

The  expected  value  of  a  non-random  function  g(x)  of  a  random  variable 
X  is  defined  as 

<g(x)>  =  E[g(x)]  =/_Ig(5)  fx  (C)  dC.  AI.2 
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It  is  also  knovm  as  the  mathematical  expectation,  ensemble  average, 

statistical  average  or  mean.  A  random  process,  explicitly  stated, 

is  the  probability  of  simultaneous  occurrence  of  n  events 

Pr{x(ti)<5(l),...,x(tn)<?(t„))  for  (tifit2. .  .f^tn)  -  Ai.3 

This  is  the  joint  CDF  of  the  random  variables  x  (ti),...,  x(tj^).  An 

orthogonal  random  process  has  the  property 

<(x(ti)  +  x(t2)  +  x(tn)...)2>=  <x2(ti)>  +  <x2(t2)>...+<x2(tn)>,  AI.4 

For   vector- valued  random  variables  X  with  components  x,,...,x  ,  the 

covariance  matrix  about  the  mean  is  defined  to  be 

[covl  =  <(x-<x>)  (xT  -  <xT>)>.  AI.5 

If  the  covariance  matrix  is  a  diagonal  matrix  (all  off -diagonal  elements 

=  0.0)  ,  the  random  variables  are  orthogonal.  The  Gaussian,  or  normal, 

PDF  for  a  random  Veiriable  is 


:-  =  ±    -m  1 


fx(£)  =  — T-  expl   |^:r-|   I  AI.6 

where 

m  =  <x>  is  the  mean,  and 


a   =/<(x  -  <x>)2>  is  the  standard  deviation. 
For  a  vector-valued  random  variable,  this  PDF  is  written 

^^^-^  "  (2Tr)n/2|covh   ^^^"''^^  "  <^»^M"^<=  "  <^>>         ^^'^ 
where 

|cov|  =  the  determinant  of  the  covariance  matrix,  and 

H  =  a  vector  of  simultaneously  observed  values  E,^. 

With  the  understanding  that  the  reindom  variable  x  has  infinitely  many 

realizations  in  ^Ct^),  we  now  allow  f(x)  in  place  of  fx(C)  to  simplify 

notation. 
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The  normal  probability  law  can  be  simply  stated  by  defining  the  standard 

normal  deviate  (z) .  For  a  zero-mean  unit- variance  Gaussian  variable, 

the  CDF  is 

F(z)  =/4f  {x)dx  =  -^  (j   e  -'s^^  dx 

•  2  IT  ^  oo 

F(o)  =  .50  indicates  the  mean.  AI.8 

F(-l)=  F(l)  =  .3413  (one  standard  deviation) 

Using  the  standard  normal  deviate,  any  Gaussian  variable  can  be  repre- 
sented by  the  equation  of  a  line: 

X  =  ra  +  za  .  AI.9 

Similarly,  the  lognormal  distribution  can  be  defined  by  x  =  ma 

where  mg  and  Og   are  the  geometric  mean  and  geometric  standard  deviation. 

Taking  the  natural  log,  we  obtain 

In  X  =  In  m™  +  z  In  Og  .  AI.IO 

These  equations  are  the  basis  for  normal  and  lognormal  probability 

graphs.  The  probability  cixis  is  linear  in  z  with  the  origin  at  f  =  50%. 

Figure  AI.l  is  a  graph  of  a  lognormal  distribution  with  a  geometric 

mean  of  100  and  a  geometric  star.dard  deviation  of  2.0. 

The  moments  of  a  distribution  are  frequently  used  to  classify  probabil- 
ity laws  and  to  estimate  their  parameters.   The  n*^  moment  about  zero 
is  defined  as  <g  (x)  >  where  g(x)=x'^.  Similarly,  the  n^  central  moment 
is  <  (x  -  <x>)">.  The  moment  generating  function  4'(t)  is  defined  as 
<e^^>  since  the  n^  moment  is 

4'''(s)|=  g^^  =<xnesx>  =  AI.IO 

s=o  <x"> I s=o . 

The  moment  generating  function  can  be  found  by  integration  for  a  given 
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axvsa  6oT  -  ^a/6it  uoT^aJ^uasuoo 
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PDF.  For  example,  using  the  normal  PDF,  Parzen  (29p.l60)  shows 

¥(3)  =  expCms  H-  a^s2/2)  .  AI.ll 

■Ehus 

<x>  =  1''(o)  =  e°(m  +  a2(o))  =  m 
<x-in>^  =  o  for  n  odd  and  AI.12 

<x-ni>'^  =  l'3-5--- (n-l)a'^  for  even  n. 

Pearson  (31)  used  shewness  (3  )  and  kurtosis  ($2)  to  graphically  repre- 
sent every  theoretically  possible  PDF.  Figure  AI.2  is  the  Qj.'    32  Pl^J^^ 
for  Pearson's  system  giving  each  distribution  type  in  terms  of  central 
moments.  These  coeficients  are  defined  to  be 

&l  =   <(x-m)3>2/<(x-m)2>3  and  AI.13 

B2  =  <(x-m)'*>/<(x-m)2>2  , 

The  lognormal  family  is  represented  by  a  line  which  intersects  the  3- 
axis  at  the  point  which  represents  the  normal  family.   The  gamma  line 
also  intersects  at  this  point.  Thus,  any  normally  distributed  random 
variable  may  also  be  represented  by  a  lognormal  or  gamma  distribution. 
In  this  case,  all  three  distributions  will  appear  as  a  straight  line 
on  log  probability  paper.  The  normal  and  log  normal  distributions 
are  related  by  m  =  n^exp  ^(InOg)   and 

7g) 

The  nonlinear  appearance  of  some  distributions  on  probability  paper 
corresponds  to  "U"  and  "J"  shaped  regions.  As  illustrated  by  Pearson, 
the  moments  of  a  distribution  implicitly  define  the  probability  law. 
A  number  of  methods  are  available  to  estimate  the  form  and  parameters 
of  a  distribution  given  a  sample  of  observed  values.   A  simple  tabulating 


AI.14 
a   =  (m2(exp(lnaq)2)-l))^. 
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of  rank  order  versus  magnitude  provides  range,  median  and  an  intuitive 

feel.  Histograms  (sample  frequency  of  occurrence)  give  a  general  des- 
cription and  suggest  possible  families.  Data  shewness  and  kurtosis 
coeficients  fix  the  type  in  the  Pearson  plane.  Once  the  general  form 
is  established,  parameter  estimate^  can  be  computed  using  the  method 
of  maximum  likelihood,  the  method  of  moments  or  other  techniques. 
Maximum  likelihood,  which  is  usually  the  most  reliable  method,  requires 
maximizing  the  likelihood  function  with  respect  to  the  distribution 
parameters.  Since  the  lognormal  distribution  occurs  often  in  air  pol- 
lution, an  appropriate  example  is  the  lognormal  likelihood  fvinction, 

i=l 
l(U,a)  is  the  probability  that  any  pair  of  y  and  a   are  the  correct 
estimates  based  on  the  sample  values  (x  ) .  Setting  the  partial  deriva- 
tives of  l(y,cj)  with  respect  to  u  and  a   equal  to  zero,  yields  a  system 
of  two  equations  in  two  unknowns.  The  lognormal  maximum  likelihood 

estimates  for  y  and  a   are 

n 

1  r 
V  =  —  -t-,   In  Xi  and 
n  1=1     1 

^  AI.16 

a2  =  1  S  (In  Xi-y)  . 
n  i=l 

This  is  equivalent  to  maximizing  the  normal-law  maximum  likelihood 
function  after  log-transforming  the  data.  When  the  likelihood  function 
cannot  be  maximized  analytically,  numerical  methods  such  as  Newton- 
Raphson  are  employed. 
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The  method  of  moments  uses  the  sample  central  moments  to  directly  com- 
pute distribution  parameters.  Unfortvinately,  estimates  using  this  method 
are  very  sensitive  to  differences  in  the  extreme  values.  Consequently, 
inconsistent  results  cam  occur  as  in  Lynn  (20) ,  where  the  two-parameter 
lognormal  distribution  performs  better  than  more  flexible  three  and  four- 
parameter  distributions . 

Larsen  (19)  suggests  a  graphic  technique  to  estimate  lognormal  y  ^^^  ^» 
The  observed  cumulative  frequency  is  calculated  based  on  rank  order 
using 

F  =  ^  "  •"*  ,  r<n/2 

n 

F  =  ^  ~  -^  ,  r<n/2  and  AI.17 

n 

F  =  .50,  r  =  (n+l)/2  if  n  is  odd 
where  r  is  the  rank  order  (highest  =  1,  lowest  =  n)  and  n  is  the  number 
of  observations. 

Plotting  log  of  concentration  versus  cumulative  frequency  on  log  pro- 
bability paper  results  in  a  distribution  graph.   The  geometric  mean 
and  deviation,  as  well  as  the  relative  goodness  of  fit,  can  then  be 
estimated.  An  extension  of  this  method  uses  an  empirical  relationship 
between  the  cumulative  frequency  and  the  standard  normal  deviate. 
The  relationship  is 

Z  =  hP^o  -^  ^1^  -^  ^2t2    ")  ^  ^  AI.18 


where 


ll  H-  d^t  +  d2t2 


+  d3t3j 


r,  100 

t  =  In  (1/f2) 

Cq  =   2.515517  Ci  =  0.802853  C2  =  0.010328 

di  =  1.432788  d2  =  0.189269  d3  =  0.001308 

-4 
and     e  =  4.5x10   =  estimation  error 


This  procedure  is  especially  useful  for  dealing  with  a  censored  distri- 
bution, as  in  air  pollution  where  80%  or  90%  of  the  measurements  may 
be  lower  than  a  threshold  value.  Linear  regression  of  ln(x),  for  values 
exceeding  this  threshold,  versus  z  can  provide  useful  estimates  where 
the  usual  maximum  likelihood  method  is  not  appropriate. 

Having  selected  a  distribution  and  fit  parameters,  tests  of  goodness 
of  fit  may  be  conducted.  Aitchison  and  Brown  (  1  )  give  methods  for 
the  lognormal;  the  sum  of  absolute  deviations  can  be  used;  or  the  non- 
parametric  Kolmogoroff-Smirnov  test  may  be  appropriate.  A  useful  tech- 
nique when  dealing  with  only  a  few  cases  is  to  plot  the  sample  distri- 
bution.  The  number  of  consecutive  observations  less  than  the  predicted 
values  and  the  number  exceeding  predicted  values  ("runs")  often  indicate 
systematic  deficiencies  in  the  selected  distribution.   If  the  chosen 
distribution  appears  to  be  inadequate,  another  family  can  be  selected 
based  on  the  results  of  goodness  of  fit  tests. 

Table  AI.l  lists  density  functions  for  distributions  encountered  in 
air  pollution.  The  lognormal  is  used  most  frequently.  The  flexible 
nat;ire  and  convenient  closed  form  of  the  kappa  make  it  an  attractive 
analytical  tool.   These  two  distributions  will  therefore  be  described 
in  greater  detail  to  illustrate  their  special  properties. 


TABLE  AI.l 
Probability  Density  Functions 
Commonly  Used  in  flir  Pollution  Modeling 
Weibull 

f„(x)  =  kxinexp  (}-kx°+V(m+lQ 

k  =  scale  parameter,   m  =  shape  parcuneter 
Exponential    (Weibull  with  m  =  0) 

fe(x)   =  k  exp  Q-k^    or  ke~^^ 
k  =  scale 
Rayleigh   (Weibull  with  m  =  1) 

fR(x)  =  kx  exp  ^-kx2/2] 
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k  =  scale  -  single  parameter  distribution 


Gamma 


ag   r(")  L.        J 
Qt  =  shape  parameter,   0  =  scale  parameter 

r(a)  =  Jo^"^  x^^'^^^dx  -  gamma  function 
Beta  (Pearson  I) 

f,(x)  -r(P)r(q),(^-A)°"(B--x)' 

^^^     r(p+q)    (B-A)P^-l 


|ni2 


A  =  scale  parameter   p  =  shape  pcureimeter   m^  =  p-1 
B  =  scale  parameter   q  =  shape  parameter   m2  =  q-1 


Lognormal 


ol  =   shape  parameter  =  Inag 
m"  =  scale  parameter  =  liuug 


Kappa 


a,9  =  shape  parameters,  3  =  scale  parameter 


A1.19 


AI.20 


AI.21 


AI.22 


AI.23 


AI.24 


AI.25 
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Larsen's  model  (  I9p.  35)  clearly  illustrates  the  concept  of  averaging 

time  in  air  pollution.  The  essential  element  is  an  empirical  formula 
which  relates  lognormal  parameters  for  measured  concentrations  to 
lognormal  parameters  for  any  other  averaging  time.  A  continuous  trace 
of  pollutant  concentration  can  be  integrated  (or  averaged)  over  a  15- 
minute,  1-hour,  24-hour  or  other  time  intervals.   Integrating  samplers, 
such  as  hi-vols  and  gas  bubblers,  provide  a  single  observation  repre- 
senting time  averaged  concentration.  The  problem  is  analogous  to 
analyzing  composit  samples  from  groves,  sections,  or  individual  trees. 
This  averaging  process  reduces  variance. 

Larsen  hypothesized  a  lognormal  distribution  for  all  averaging  times 
with  a  standard  geometric  deviation  (a  ) ,  which  decreases  as  the  aver- 
aging interval  increases.   For  the  longest  possible  averaging  time,  o^ 
becomes  unity,  indicating  a  constant  distribution.  Note  that  the  equi- 
valent normal  distribution  under  this  condition  has  a  standard  deviation 
of  zero  and  that  the  arithmetic  and  geometric  means  are  equal.  Figure 
AI.3  illustrates  Larsen's  model.  Empirically,  he  discovered  that  a 
plot  of  lognormal  median  concentration  versus  lognormal  averaging  time 
yields  a  straight  line.  This  line  passes  through  the  arithmetic  mean 
and  has  slope 

s  =  (In  C2-  In  Cj^)/(lnt2-  Int^^) .  AI.26 

When  t2  is  the  longest  averaging  time,  C2  is  the  arithmetic  mean. 

Thus 

s  =  ln(m/mg(ti))/ln(t2/ti) .  AI,27 

The  arithmetic  and  geometric  means  are  related  by 

ln(  m/mg)  =  .5(lnCg)2.  AI.28 
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If  the  longest  average    (t2)    is  denoted  t^^f^,  we  obtain 

s=   .5(lnag(ti))2/ln(ttot/ti)   =  •5(ln(ag(t2) )  Vln(ttot/t2)  •  ^'^^ 

The  equation 

C7g(ti)    =    [agCtj)]'',  AI.30 

where 

V  =  in  (ttot/ti)An(ttot/t2)' 
thus  follows.  Similarily, 

ing{t2)  =ing(ti)[exp  (.5  (1-v)  [  IncTg(tj^)  2] )].  ai.31 

Since  air  quality  standards  have  been  promulgated  for  periods  less 

than  or  equal  to  one  year,  t^  ^  is  usually  chosen  to  be  8760  hours. 

tot 

For  the  distribution  in  Figiore  AI.l,  the  relation 

In(ing)  =  0.0407140  (ln(t))  +  4.47577  AI.32 

is  derived  from  the  above  equations.  The  maximum  expected  value  for 
each  averaging  time  is  found  from 

^MAX  =  mgOg^f^AX  AI.33 

using  mg  and  Og   from  these  same  equations.   For  sample  times  less  than 
1  month. 

In  cj^jyj  =  -.34(lnt)  +  7.75  AI.34 

applies.  Thus  short  averaging  times  yield  some  very  high  observations 
and  many  neax  zero  observations  resulting  in  a  much  larger  variance 
than  occurs  for  longer  averaging  times. 

A  basic  principle  of  modern  statistics  is  known  as  the  central  limit 
theorem.  No  matter  what  the  distribution  of  the  original  variate,  this 
theorem  requires  that  the  sum  of  independent  stochastic  variates  be 
normally  distributed  for  large  numbers  of  samples.   The  lognormal 
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family  offers  an  apparent  contradiction  of  the  central  limit  theorem 

as  pointed  out  by  Marcus  in  Kornreich  (22p7-l)  and  by  Mitchell  (24) . 
Thus,  Lars  n's  hypothesys  does  not  adhere  to  this  theorem.  This 
phenomenon  can  be  partly  explained  in  terms  of  stationarity.   The 
central  limit  theorem  allows  very  general  conditions,  but  assumes  the 
variate  comes  from  a  single  distribution.  Air  pollution  measurements 
are  taken  from  distributions  which  are  constantly  changing  with  time. 
The  actual  form  of  the  distribution  at  any  particular  moment  depends 
on  meteorology  and  the  sovirce  to  receptor  relationship.   Pollack  (33) 
used  the  classical  Gaussian  model  to  investigate  conditions  which  would 
generate  a  lognormal  distribution.  He  concluded  that  receptors  affected 
by  many  sources  under  moderate  or  higher  wind  speeds  would  be  lognormal. 
However,  receptors  affected  by  isolated  point  sources  are  expected  to 
yield  exponential  or  Rayleigh  distributions.   Under  calm  conditions 
or  for  low  wind  speeds,  the  form  of  the  distribution  is  unknown.   Pollack 
also  demonstrated  the  similarity  between  the  distributions  commonly 
applied  to  air  pollution  data  in  support  of  the  lognormal  assumption. 
Bencala  and  Seinfeld  (3)  reached  basically  the  same  conclusions.   The 
distribution  followed  by  variates  which  are  driven  by  the  intricate  ephe- 
meral processes  encountered  in  air  pollution  cannot  be  precisely  known. 
Thus  the  lognoinnal  model  is  a  practical-  alternative  in  spite  of  defici- 
encies . 

An  alternative  distribution,  used  to  describe  meteorological  phenomena, 
is  the  3-parameter  kappa  family  introduced  by  Mielke  and  Johnson  (23) . 
There  is  no  theoretical  basis  for  the  application  of  this  distribution, 
but  its  closed  form  and  flexibility  are  very  attractive  for  air  pollution 
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research.  The  ability  of  the  kappa-3  to  conform  to  a  nearly  straight 

line  analogous  to  the  lognonnal,  or  to  wiggle  as  necessary,  will  appeal 
to  cinyone  who  has  plotted  a  probability  graph  for  air  pollution  measure- 
ments. The  convenient  closed  form  for  both  the  density  fionction  and 
the  distribution  function  are  also  very  attractive  for  computing  the 
order  statistics. 

The  analysis  of  eiir  quality  data  attempts  to  answer  two  questions: 
what  is  the  probability  that  values  during  the  year  san^led  exceeded 
the  standard;  and  with  what  probability  will  values  exceeding  the  stan- 
dard be  realized  in  sixbsequent  years?  With  respect  to  short-term  stan- 
dards, the  answer  to  the  first  question  is  foxind  from  the  cumulative 
distribution.   For  example,  the  highest  and  second  highest  values  in 
365  possible  samples  are  approximately  .1644%  and  .4384%.  Presumably, 
if  the  concentration  which  corresponds  to  .44%  is  less  than  the  24-hour 
standard,  no  two  days  in  the  year  sampled  exceeded  this  standard. 
The  second  question  can  be  answered  by  evaluating  the  order  statistic 
of  the  highest  and  second  highest  observations  as  discussed  in  David 
(  8 ) ,  The  rth  order  statistic  takes  the  form 

n 
Fr,(x)  =  Z        (j)  [FCx)^]   [1  -  F{x)P"^  A1.35 

j=r 

where 

Fj.{x)  =  Gvunulative  distribution  of  the  rth  order  statistic 

r  =  rank  order  of  observations  (r=l  the  lowest  value, 
r  -  n  the  highest) 

n  =  niunber  of  observations  in  a  sample 

(j)  =  n!/(j!  (n-j)!)  . 
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The  distribution  for  the  highest  day  in  365  can  easily  be  computed 

since  F3g5  (x)  =  [F(x)]^^^.  When  ^  c(x)  =  .50,  the  value  of  x  is  the 
expected  value  of  the  highest  in  a  sample  of  365  observations.  The 
90%  confidence  interval  can  be  constructed  by  finding  those  values 
of  X  for  which  ^355  (x)  is  .95  and  .05.  These  computations  cire  illus- 
trated in  Figure  AI.l.  A  graphical  procedure  must  be  used  since  the 
lognormal  CDF   cannot  be  evaluated  exactly.  If  365  values  were  sampled 
from  a  lognormal  population  with  ra_  =  100  aind  CTg  =  2,0,  90%  of  these 
sets  of  365  values  would  contain  a  maximum  value  between  540  and  1260 
Mg/m  .  The  maximum  would  tend  to  be  767,  but  occasionally  the  highest 
value  will  exceed  1260  vig/m^.  The  second,  third  and  other  order  sta- 
tistics provide  additional  interpretation  when  their  joint  probabilities 
are  con^JUted.  Unfortunately,  these  computations  become  veiry  tedious 
for  the  lognormal  family.  The  closed  form  kappa  provides  an  easy  method 
of  further  investigating  the  probability  of  exceeding  standards  in 
subsequent  years,  including  the  median  order  statistic.  However,  the 
usefulness  of  this  technique  is  limited  by  the  non-stationary  character 
of  =iir  pollution  distributions. 

The  chemilxominescent  monitors  which  generated  the  data  used  in  this 
study  were  calibrated  in  three  steps.  A  spectrophotometer  curve  of 
absorbance  versus  ozone  concentration,  as  iodine,  was  first  constructed. 
A  portable  ozone  source  was  then  calibrated  to  determine  the  absorbance 
of  a  known  volume  of  gas  versus  a  slide  setting  on  the  ozone  generator. 
In  the  field,  each  monitor  was  then  calibrated  by  determining  the  chart 
deflection  which  resulted  at  each  slide  setting.  As  discussed  in 
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Snedecor  and  Cochran, (42p.l59)  prediction  of  a  dependent  variable  from 

an  independent  variable  differs  from  prediction  of  an  independent 

variable  from  a  dependent  variable.  Taking  this  difference  into 

account,  three  linear  calibration  cvirves  and  a  total  calibration  can 

be  foinnulated  as 

y  =  m]_X  +  bi  -  spectrophotometer 

y  =  m2X*  +  b2  -  ozone  source 


AI.36 


y*-  m3X*  +  b3  -  field  monitor 

m2                                     b2-bi  in2b3 

X  =  r~;r"    (y*-offset)  +  — —  -   -r-r- 

m2^m3       ■*                              m}_  nix^3 


where 

y  =  absorbance 

y*=  field  monitor  chart  deflection 
X  =  ozone  concentration 
X*=  slide  setting  on  ozone  source 
irij  =  slope  of  calibration  line 
bj[  =  intercept  of  calibration  line 
offset=  chart  deflection  at  zero  concentration. 

Figure  1.4  illustrates  the  three  stage  calibration  system.   A  calibra- 
tion system.  A  50S  confidence  bound  was  applied  to  each  independent 
calibration,  since  exactly  meeting  this  bound  three  successive  times 
represents  an  overall  confidence  interval  of  approximately  90% 
(1.0-(.5)  ■^) .  Thus  a  chart  deflection  of  20%  in  the  example  calibration 
yields  an  overall  90%  confidence  interval  of  90J:  7  parts  per  billion 
as  shown  in  Table  I.   This  is  considered  to  be  very  accurate  and  supports 
the  validity  of  reported  ozone  data  in  Hillsborough  County. 


FIGURE  AI.4 
Example  Three  Stage  Calibration- 
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TABLE  AI.2 
Confidence  Limits  for  Ambient  Ozone  Calibration 

Ambient  Station       -  Davis  Islands 

Date  of  Calibration   -  June  28,  1974 

Period  of  Application  -  June  28  to  August  8,  1974 

Field  Calibration     -  Y*=(-0.4281) X*  +  73.8810 

Ozone  Calibration     -  Y=(-0.1129)X*  +  19.1449 

Spect  Calibration     -  Y=(  0.0523)X  +  -0.0066 

The  50  percent  confidence  interval  on  each  curve  isj 


20.00 
126.94  129.47 

5.01  4.62  4.73  4.33 

95.94    95.91    88.40    88.37    90.52    90.49    82.92    82.89 

Overall  Calibration  Curve  -  X=(5.0372) (Y*-Offset)  +  -6.26 
Overall  90  Percent  Confidence  Interval  is: 
89.5  ±6.5  parts  per  billion 


Ph^';:t'^/ 


APPENDIX  II 
PRINCIPAL  COMPONENTS 
Orthogonal  principal  components  plan  an  important  role  in  this  disser- 
tation. In  the  ozone  application,  each  case  is  reduced  from  24  inter- 
correlated  hourly  values,  to  six  factor  scores  which  represent  indepen- 
dent weighting  functions.  With  proper  scaling,  this  technique  allows 
the  use  of  the  euclidean  metric  to  achieve  a  Mahalanobis  measure 
of  distance  between  cases.  Appendix  II  briefly  describes  some  matrix 
considerations  and  presents  a  three  dimensional  example. 

A  matrix  is  a  convenient  means  of  representing  mathematical  information. 

In  many  applications,  it  provides  a  systematic  method  of  extending 

one  dimensional  results  to  multiple  dimensions.  Virtually  all  matrices 

Ccui  be  represented  as  polynomials  (characteristic  equation) ,  but 

complications  arise  where  a  generalized  matrix  is  considered.  A 

real  valued  matrix  with  three  rows  and  three  columns  is  equivalent 

to  a  third  order  polynomial.  For  example,  the  matrix 

2-10 

[a]  =  -1     2-1  AII.l 

0-12 

has  an  equivalent  form  in  the  polynomial 

f(X)  =  \^   -6X2  +  loA  -4.  All. 2 

Algebra  and  elementary  calculus  can  be  used  to  determine  the  three 

roots  of  this  equation  and  the  relative  maximum  and  minimian  points. 

The  three  roots  of  [aJ  are  2  and  it.  J 2.     These  are  the  eigenvalues 

of  [  a]  .  A  number  of  methods  have  been  developed  to  solve  for  these 

eigenvalues,  given  a  matrix.  Real  symmetric  matrices  are  particularly 

111 
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easy  to  solve  and  con5>uter  algorithms  for  these  are  readily  available. 

It  is  also  a  well  known  fact  that  an  nxs  dimensional  real  symmetric 

matrix  always  has  n  independent  real  eigenvalues.  Thus  the  existance 

of  eigenvalues  and  associated  eigenvectors  in  the  application  presented 

in  this  dissertation  is  guaranteed,  since  the  covariance  matrix  has 

real  symmetric  fonn. 

Eigen  analysis  deals  with  the  solution  of  singular  linear  systems 
of  equations.  The  matrix  A  is  called  the  coeficient  matrix  of  the 
system.   If  the  equation 

[a][e]=  0  All. 3 

has  solutions  other  than[E]=  0,  then  the  equation 

[a][e]=  c  All. 4- 

has  either  an  infinite  number  of  solutions  or  no  solution  at  all.  This 
is  a  singular  system.   The  infinite  solutions  arise  since  any  arbitraiy 
constant  multiplied  by  the  eigenvector [e] is  also  a  solution.  When 
the  set  of  eigenvectors  associated  with  each  eigenvalue  is  considered 
collectively,  the  matrix  [e]  is  known  as  the  modal  matrix.   [e1  is 
an  orthogonal  matrix  with  the  independent  eigenvectors  as  columns. 
Consequently, 

[EHEl'^=  [I]  AII.5 

[E]"^  =  [E]"^  All. 6 

and  the  inverse  is  guaranteed  to  exist.  A  unique  solution  to  the  equa- 
tion 

[E]'^[A][E]  =   A[I]  AII.7 

can  be  specified  by  requiring  the  sum  of  the  eigenvalues  (A^)  to  equal 
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the  sum  of  the  diagonal  elements  of  matrix[A.]  Then  the  eigenvectors 

of  [e]  are  called  the  orthogonal  principal  components  of  the  coeficient 

matrix  [A] .  Equation  AIt.7  is  often  written 

[a][e]  =  [eI  [i]  All. 8 

by  virtue  of  the  equivalence  of  the  inverse  and  transposed  model  matrix. 

For  the  example  given  above,  the  model  matrix  is 

111 

[e]  =     -JT    0    rr  All. 9 

1-11 

and  the  columns  are  eigenvectors  corresponding  to  the  eigenvalues 
2  +  2,  2,  and  2-  2  respectively.  To  illustrate  equation  A2.8,  we 
have 


2-10 

-1       2       -1 

0-1  2 


111 

-j2    0   yr 

1-1     1 


111 
-yr   0   yr 


1  -1 


2+y2'        0  0 

0  2  0    fen. 10 

0  0      2-^2* 


performing  the  indicated  multiplication  (row  by  column)  we  obtain 


2+yr     2 

2-/r 

2+J2 

2 

2-JT 

2-2/2"         0 

-2+2/2" 

= 

2-2/1 

0 

-2+2/2" 

2+yr    -2 

2-J2 

2+J2 

-2 

2-yr 

All. 11 
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ABSTRACT  115 

Days  with  high  ozone  concentrations  occvirred  in  April  and  May,  after 
the  passage  of  a  cold  front.  During  these  periods,  Tampa  was  under 
the  influence  of  a  high  pressure  system  with  clear  skies,  mild  tempera- 
tures, and  less  than  normal  relative  humidity  and  wind  speed.  Wind 
direction  did  not  appear  to  have  a  primary  influence  on  ozone  concen- 
tration; but,  some  evidence  exists  showing  a  possible  link  between 
stratospheric  and  surface  concentration.  Days  with  low  to  moderate 
ozone  concentrations  were  days  with  reduced  surface  insolation,  due 
to  either  significant  cloud  development  or  low  solar  elevation.  From 
the  data  cxirrently  available,  it  was  not  possible  to  delineate  meteoro- 
logical differences  between  episode  and  non-episode  days.   However, 
super  episode  days  appear  to  be  associated  with  passing  cold  fronts, 
high  pressure  systems  and  strong  solar  insolation. 
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Meteorological  Characteristics  of  Days  Grouped 
on  the  Basis  of  Surface  Ozone  Concentration 
as  Observed  in  Tampa,  Florida 


A  preliminary  meteorological  study  was  conducted  on  selected  days  in 
1974  based  on  principal  component  and  cluster  analysis  results  for 
Tampa,  Florida.  Days  from  groups  denoted  super  episode,  episode  and 
non-episode  were  selected  based  on  temporal  stationarity  and  spatial 
homogeneity.  These  days  were  selected  so  that  microscale  meteorologi- 
cal effects  would  be  minimized  and  the  extremely  complicated  details 
of  the  Tan^ja  Bay  meteorology  could  be  ignored.  The  objective  of  the 
study  was  a  determination  of  synoptic  eind  mesoscale  patterns  which 
could  be  used  to  characterize  the  three  group  types.  Contrast  of  mete- 
orological conditions  on  different  type  days  identified  possible  influ- 
ence parameters.  However,  the  standard  meteorological  data  ciarrently 
available  are  not  sufficient  to  explain  many  of  the  observed  differences 
in  surface  ozone  concentration. 

Data  for  this  study  were  obtained  from  a  variety  of  sources.  The 
synoptic  pattern  for  each  day  was  determined  from  the  Daily  Weather 
Maps,  Weekly  Series  published  by  the  National  Oceanographic  and  Atmos- 
pheric Administration  (NCAA) .   These  publications  contain  daily  surface 
and  500  mb.  weather  maps,  valid  at  7:00  A.M.  Eastern  Standard  Time. 
Areas  and  amounts  of  total  precipitation  are  also  given.  Wind,  tempera- 
ture, sunshine,  sky  cover,  and  weather  type  data  were  obtained  from 
the  Local  Climatological  Data  summaries  for  Tampa.  Hourly  wind  and 
stability  data  were  obtained  from  a  computer  model  run  previously  for 
the  Tampa  Bay  area.  Where  required,  data  from  the  Hourly  Precipitation 
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Data  summaries  were  also  iised.  Finally,  data  from  the  Dobson  spectro- 
photometer located  in  Tallahassee  were  used  to  obtain  a  general  idea 
of  the  total  ozone  in  the  atmosphere. 

From  a  meteorological  viewpoint,  super  episode  days  were  distinctly 
different  from  the  other  days  studied.  The  episode  and  non-episode 
days  differed  more  in  degree  from  one  another  than  in  stibstance. 
Also,  the  meteorological  parameters  defining  a  super  episode  day  were 
more  consistent  than  those  defining  non-super  episode  days.  There 
appears  to  be  a  unique  type  of  meteorology  associated  with  super  episode 
days,  but  there  were  several  types  of  meteorology  associated  with  the 
other  types  of  days. 

The  most  obvious  chairacteristic  of  the  super  episode  days  was  the 
time  of  year  of  their  occurrence.  This  type  of  day  occurred  primarily 
from  April  through  July.  Prior  to  April,  the  insolation  was  insuffi- 
cient (due  to  the  low  solar  elevation)  to  produce  high  ozone  concen- 
trations. After  May,  synoptic  systems  were  generally  too  weeik  to  domi- 
nate the  local  circulation  systems.   The  high  pressure  systems,  in 
particular,  were  too  weak  to  suppress  diurnal  cloud  development. 

Super  episode  days  typically  began  12  to  36  hours  after  the  passage 
of  a  cold  front.  There  were  frontal  passages  on  April  16,  April  24, 
and  May  7.   The  super  episode  days  occurred  while  under  the  influence 
of  the  high  pressure  system  which  usually  follows  a  cold  front-   These 
were  days  of  clear  skies,  mild  temperatures,  and  lower  than  normal 
relative  humidity.  The  wind  speeds  were,  on  the  average,  less  than 
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normal.  Ozone  concentration  increased  with  increasing  wind  speed. 

(This  was  expected  since  both  parameters  increase  with  increased  solar 
insolation) .  The  wind  direction  varied  from  westerly  to  northerly 
to  easterly,  with  no  consistent,  descernible  correlation  to  ozone  con- 
centration. On  many  super  episode  days,  there  was  a  week,  but  definite 
sea  breeze.  Again,  however,  there  was  no  significant  recognizable 
effect  on  ozone  concentration.  While  there  was  ordinarily  no  rain 
measured  on  super  episode  days,  rain  was  often  associated  with  the 
preliminary  frontal  passages. 

Based  on  a  superficial  qualitative  investigation  of  the  Dobson  spec- 
trophotometer data,  it  appears  that  a  relationship  may  exist  between 
the  total  ozone  in  an  atmospheric  column  and  the  surface  ozone  concen- 
tration. A  seasonal  correlation  is  expected,  since  both  quantities 
are  believed  to  be  dependent  on  the  amoiant  of  insolation.  However, 
the  existence  of  a  daily  correlation  would  be  unexpected,  as  approxi- 
mately 90%  of  the  total  ozone  in  an  atmospheric  column  is  contained 
in  the  stratosphere.   If  such  a  correlation  were  found,  it  could  indi- 
cate a  link  between  surface  and  stratospheric  ozone. 

There  was  one  characteristic  found  to  be  common  on  non-super  episode 
days.  For  a  variety  of  reasons,  these  days  had  insufficient  insolation 
to  generate  abundant  ozone  during  most  of  the  daylight  hours.   For 
example,  March  episode  days  were  meteorologically  very  similar  to  the 
super  episode  days  of  April  and  May.  During  one  March  episode,  the 
Tampa  Bay  area  was  under  the  influence  of  a  high  pressiure  system  in 
which  there  was  little  cloud  cover  and  light  winds.   The  difference 
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between  these  days  and  the  super  episode  days  appears  to  have  been 

primarily  the  time  of  year.  Prior  to  April,  the  solar  elevation  was 
generally  too  small  to  produce  the  required  surface  insolation,  regard- 
less of  the  amount  of  cloud  cover. 

For  two  of  the  non-episode  day  groups,  the  factor  limiting  insolation 
was  the  cloud  cover  associated  with  approaching  frontal  activity. 
A  non-episode  group  in  November  was  followed  by  the  passage  of  a  cold 
front  on  November  21.   Similarly,  a  non-episode  group  in  December  was 
followed  by  a  frontal  passage  on  December  16.  With  few  exceptions, 
the  November  and  December  non-episode  days  were  mostly  cloudy  to  com- 
pletely overcast.  On  the  average,  they  received  less  than  5  hours 
of  sunshine  per  day.  This  amount  of  cloud  cover,  coupled  with  the 
time  of  year  and  low  solar  elevation  significantly  reduced  the  inten- 
sity of  the  surface  insolation. 

Extensive  cloud  cover  was  also  present  during  the  September  2  through 
September  5  non-episode  days.   While  the  presence  of  cloud  cover  was 
influenced  by  a  weak  cold  front  north  of  Tampa,  it  was  more  directly 
caused  by  the  development  and  movement  of  Hurrican  Carmen.   Clouds 
associated  with  hurricans  are  deep  cumuliform  clouds  which  have  greater 
albedos  and  greater  optical  thicknesses  than  those  associated  with 
shallow  cold  fronts.  Thus,  even  with  the  high  solar  elevation,  the 
solar  radiation  reaching  the  stir face  would  have  suffered  significant 
attenuation . 
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In  the  summer  months,  all  three  types  of  days  were  common.  During 
this  season,  the  synoptic  weather  pattern  becomes  a  secondary  rather 
than  a  primary  influence  on  weather  in  South  Florida.   These  were 
warm  days  with  generally  well  developed  sea  breeze  circulations.   The 
most  common  denominator  for  summer  days,  however,  was  the  presence 
of  thunderstorms  in  the  area.  The  weather  observer  was  able  to  hear 
thunder  from  the  station  at  some  time  during  most  of  these  days. 
On  days  where  the  hour  of  the  thunderstorm  was  known,  a  marked  decrease 
in  ozone  concentration  often  coincided  with  the  storm.   There  was  no 
obvious  difference  between  the  episode  day  meteorology  and  non-episode 
day  meteorology.  This  does  not  mean  that  siabtle  differences  (in,  for 
example,  cloud  height,  thunderstorm  electricity,  sea  breeze  character- 
istics, etc.)  did  not  exist;  it  means  only  that  the  standard  meteoro- 
logical data  reported  was  not  sufficient  to  delineate  these  differences. 
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